redistribute RDY in high throughput, idle producer situations #277

mscso · 2019-12-05T21:06:30Z

When there are several producer nsqds registered on a nsqlookupd but only one of them (at least not all) is currently producing messages, the current flat max-in-flight distribution leads to the consumer effectively having fewer messages in flight than we might want.

Consider a situation of 4 hosts / nsqds being used to produce messages on a topic - but only one of them is used at a time (various reasons). Consider a single consumer setting max-in-flight of 8. These are equally spread so each nsqd connection will have a RDY count of 2. Since at any point in time 3 of the 4 nsqds are idle / not producing on the topic, we effectively only ever have 2 messages in flight.

One workaround is to increase the max-in-flight drastically (multiply by nsqd count) but then we might have more messages in flight than our consumer wants if suddenly more than one nsqd is producing messages.

We constantly deal with this situation (automatically scheduled producer containers that move between hosts), we implemented a second RDY redistribution function that trades RDY count from an unused nsqd connection to a "busy" nsqd connection.

Since this might not be useful / wanted in every use case the feature is only enabled with a config flag RDYTrading.

The code is similar to the normal code in redistributeRDY for the max-in-flight < len(conns) situation but here it essentially deals with max-in-flight > len(producing_conns).

Let me know what you think and whether this could be useful for others and thus whether you think it could be merged upstream.

NSQ2019/12/05 14:26:34 DBG 1 [foo/bar] looking for RDY trade possibilities...
NSQ2019/12/05 14:26:34 DBG 1 [foo/bar] - moving 3 RDY from 10.13.2.51:4150 to 10.13.2.85:4150
NSQ2019/12/05 14:26:34 DBG 1 [foo/bar] - moving 3 RDY from 10.13.2.39:4150 to 10.13.2.85:4150

mscso · 2019-12-05T21:28:51Z

And right after I open a PR I realize that I don't correctly trade back RDY when nsqd conns go from idle / unused to busy / used. Will amend.

ploxiln · 2019-12-05T23:24:12Z

I think this kind of improvement would be welcome ... but historically the ready-count distribution has been the trickiest logic, when you add in backoff and such. If we can be sure that this new logic doesn't get "stuck" in an on or off state when it shouldn't, then we probably would not want a new config option. Ideally, the code works well, and there would be no need to "turn it off if it causes problems".

Although I'm currently one of the nsq maintainers, I'm not the biggest contributor to this particular repo, and don't have a lot of familiarity with this implementation of the ready-count logic 😅 so I can't promise very prompt review.

jehiah · 2020-01-16T19:02:35Z

@mscso thanks for starting this discussion/effort; I appreciate many of the challenges you described around uneven message distribution as i've experienced them as well.

As @ploxiln mentioned, there is good appetite for improving this aspect of nsq / go-nsq, but an equal dose of caution because a one-size-fits-all has been elusive so far despite it being desired.

I have a few high level thoughts on how we might think of changing the paradigm to resolve this at a higher level:

In the case of an uneven distribution would pruning the presence of 'unused' topics help? I.e. when you have 3 of 4 nsqds that aren't getting messages on a topic, safely 'delete' the topic on 2 of those. Sort of a cluster "tidy" option. (it would be great if nsqd exposed how long it's been since the last message was received on a topic for this, and had a flag so that delete only applied if the topic and all channels were empty to make this easy/safe)
A related area that has me thinking about RDY distribution is to make it easier to influence a zone/region priority in a cloud environment; go-nsq gives you an ability to on/off that flow with nsq.DiscoveryFilter but perhaphs we need to expose some interface for overriding the RDY distribution w/ other strategies. Perhaphs this could be facilitated w/ a concept of priority for connections - if a connection provided a priority nsqd could prefer higher priority connections when it has a message to send out, and fall back to lower priority connections only when needed. (i.e. when connecting to a same-zone source a consumer would set priority=20, when connecting to a same-region source it might set priority=10, and when connecting to another region it might set priority=0 then nsqd would prefer same-zone first, then same-region then any available consumer all w/o having a specific knowledge of what might influence priority)
I wonder what information would make this logic easier; If Nsqd had a way to signal to consumers it's queue depth, the consumer could more effectively start at a max-in-flight of 1 and throttle up specific connections to a concurrency limit based on where messages are actually backed up.

Do you feel any of these would better fit your needs and reduce complexities around max-in-flight settings?

cc: @mreiferson @ploxiln

redistribute RDY in high throughput, idle producer situations

97efe67

reworking RDY trading again

b16e388

jehiah added the enhancement label Jan 16, 2020

ploxiln mentioned this pull request Mar 1, 2020

reduce duplicate rdy update requests #282

Merged

heipei mentioned this pull request Mar 1, 2022

Redistribute RDY among NSQD connections dudleycarr/nsqjs#380

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

redistribute RDY in high throughput, idle producer situations #277

redistribute RDY in high throughput, idle producer situations #277

mscso commented Dec 5, 2019

mscso commented Dec 5, 2019

ploxiln commented Dec 5, 2019

jehiah commented Jan 16, 2020 •

edited

Loading

redistribute RDY in high throughput, idle producer situations #277

Are you sure you want to change the base?

redistribute RDY in high throughput, idle producer situations #277

Conversation

mscso commented Dec 5, 2019

mscso commented Dec 5, 2019

ploxiln commented Dec 5, 2019

jehiah commented Jan 16, 2020 • edited Loading

jehiah commented Jan 16, 2020 •

edited

Loading