Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add write every 10K messages #12

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

christianbundy
Copy link
Member

Resolves #11

This could probably be much more sophisticated, but given the fact that I have 700 million messages I don't think a few more save points would be too terrible.

@dominictarr
Copy link
Collaborator

how did you pick ten thousand?

I think a time based thing would make more sense, because that's the thing the users feel. 10 thousand messages could be lots or not much, it's hard to say.

@christianbundy
Copy link
Member Author

how did you pick ten thousand?

Totally arbitrary, I wanted something between 1 and log.since.value.

I think a time based thing would make more sense, because that's the thing the users feel.

Any thoughts on how you'd like to see this implemented?

@arj03
Copy link
Contributor

arj03 commented Sep 18, 2019

Would be interesting to see a performance test of this.

@dominictarr
Copy link
Collaborator

the fact that I have 700 million messages

hmm, you must mean 700 megabytes of messages?

@christianbundy hmm, I guess you'd have a timeout (unref it so it doesn't keep the node process going) that causes the write, and clear it if you do a write because it's in sync. Hmm, funny feeling that sometimes I've seen flumeview-reduce write many times, because the reduce is so light that it if you are streaming the records in, it will write many times. Lets ignore that for now since it didn't do that currently, anyway.

Hmm, also: maybe it would be best to fix it in async-single? and then call that every write? then flumeview-reduce can focus on just doing database stuff. If I recall correctly, the way I'm doing clearTimeout and setTimeout is a gonna be a bit slow there...

@mixmix
Copy link
Member

mixmix commented Sep 29, 2019

This code looks good to me, seems like a smart improvement.
10k seems high, but for 700MB of 8k messages that would be only ~9 write calls. Many messages will be less than that... feels about right though. I wonder what the mean message size is (backs carefully away from nerd-snipe trap)

@dominictarr
Copy link
Collaborator

@mixmix mean message size is about 700 bytes, because the most popular messages, contacts (follows), and votes (likes) are small.

@mixmix
Copy link
Member

mixmix commented Sep 29, 2019 via email

@dominictarr
Copy link
Collaborator

@mixmix flumedb is ment to be generic, it's not just for ssb. That's why there is no "ssb-" at the start.
also, to get the correct write number, also multiply by the number of reduces.

@mixmix
Copy link
Member

mixmix commented Oct 1, 2019 via email

@dominictarr
Copy link
Collaborator

@mixmix see the organizational patterns doc https://hackmd.io/IM5_tWIfSFuNoe3jtrjrtQ#well-thought-out-good-ideas

@christianbundy
Copy link
Member Author

Been thinking about this and I should probably write it down.

Are there any simple gradient descent patterns we could use here? I'm sure there are much better algorithms, but I'm imagining an implementation where you do batches of:

  • 16: If the batch of 16 finishes first, you do 8 16 32.
  • 32: If the batch of 32 finishes first do 16 32 64 again
  • 64 If the batch of 64 finishes first you do 32 64 128

I'm sure we have better tools at our disposal, but I'd imagine that any solution that worked here would also work in flumedb/flumeview-level#15.

@dominictarr
Copy link
Collaborator

that's a good idea.

hmm, we've had this type of problem a bunch of places...
maybe the question is how do we model it?
Okay there is a tradeoff with batch size, latency and throughput.
but what's the tradeoff? how does it work? hmm, on tcp, each packet has overhead,
so many small packets mean many headers, throughput includes the headers so it means that the total proportion of throughput that is payload is smaller.

You could approach this by setting a relative value for latency and throughput. (That value will be hard to define of course)

But also, you could just try to maximize just throughput you'd send small packets, but not if there are many of them... oh hang on, lets say that each packet has a 10 byte overhead. but how do you decide it's okay to send 1 packet a second, but not one packet a millisecond? hmm, if you had a time box, then you'd maximize throughput within that time... but wouldn't that just be delaying until the end of that box? (not ideal)

If you know the parameters of the system, say, in tcp the maximum transfer unit is 1500 bytes, so no reason to buffer bigger than that... but sometimes you don't know what the upper layers in the system will do. (prehaps there is an encryption layer or other framing that adds more overhead...)

Okay, maybe the right idea is to try and look at prior art, https://en.wikipedia.org/wiki/Nagle%27s_algorithm

@dominictarr
Copy link
Collaborator

sorry, brain splurg there

@dominictarr
Copy link
Collaborator

turns out that nagels algorithm is really simple, we do a write and if a unacknowledged packet has already been written, buffer that write until you get the ack.

I already used that pattern with https://www.npmjs.com/package/pull-write if the write hasn't called back yet, buffer until it does. (pull-write also has a fixed max)

the nice thing about nagel is that it doesn't actually require a model of overhead, but that's also the problem.

I found a recent comment from nagel himself
https://developers.slashdot.org/comments.pl?sid=174457&threshold=1&commentsort=0&mode=thread&cid=14515105

The right answer is that TCP should keep track of whether delayed ACKs are "winning" or "losing". A "win" is when, before the 500ms timer runs out, the application replies. Any needed ACK is then coalesced with the next outgoing data packet. A "lose" is when the 500ms timer runs out and the delayed ACK has to be sent anyway. There should be a counter in TCP, incremented on "wins", and reset to 0 on "loses". Only when the counter exceeds some number (5 or so), should ACKs be delayed. That would eliminate the problem automatically, and the need to turn the "Nagle algorithm" on and off.

here is another one: https://news.ycombinator.com/item?id=9050645 but this time he just says disable delayed ack.

@dominictarr
Copy link
Collaborator

this is also interesting: https://en.wikipedia.org/wiki/TCP_tuning

High performance networks have very large BDPs. To give a practical example, two nodes communicating over a geostationary satellite link with a round-trip delay time (or round-trip time, RTT) of 0.5 seconds and a bandwidth of 10 Gbit/s can have up to 0.5×1010 bits, i.e., 5 Gbit = 625 MB of unacknowledged data in flight. Despite having much lower latencies than satellite links, even terrestrial fiber links can have very high BDPs because their link capacity is so large. Operating systems and protocols designed as recently as a few years ago when networks were slower were tuned for BDPs of orders of magnitude smaller, with implications for limited achievable performance.

with a half second latency nagel would cause transmission to go very slowly, of course...

@dominictarr
Copy link
Collaborator

https://en.wikipedia.org/wiki/Bufferbloat

some modern network devices have large buffers, but that means tcp doesn't know it's saturated, because of extra buffering it's reliable and doesn't slow down, so it just sends more data, but that takes longer for the buffer to drain.

how to test for buffer bloat: https://www.bufferbloat.net/projects/bloat/wiki/Tests_for_Bufferbloat/

do ping google.com in a terminal, then run a speed test. If the ping gets longer while the speed test runs then you are having buffer bloat. That happened to me! ping changed from 75ms to 300ish.

@dominictarr
Copy link
Collaborator

super interesting rabbit hole here: http://blog.cerowrt.org/post/net_neutrality_customers/

@black-puppydog
Copy link
Contributor

Folks, can we just merge this? You seem to all agree that for any practical purpose this works and is a net improvement for whoever is using the library.
And it would solve ssbc/patchwork#935 which can be a real pain for long syncs.
If someone wants to implement a fancy algorithm to replace this, why not? But I'd want to see them feeling that pain before.

The only real argument against this that I have seen is from @dominictarr, that this might impact non-ssb applications because the value chosen is too low and it causes tons of writes'.
If we choose the write value big enough, this PR essentially becomes a no-op: most applications will finish indexing before the threshold is reached. So the question is: do we think 10K is close enough to that point?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Doesn't save until all the way synced (I think?)
5 participants