Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HubEvent subscription responses are out of order #2143

Open
BlinkyStitt opened this issue Jul 9, 2024 · 1 comment
Open

HubEvent subscription responses are out of order #2143

BlinkyStitt opened this issue Jul 9, 2024 · 1 comment
Labels
s-ready Ready to be picked up

Comments

@BlinkyStitt
Copy link
Contributor

BlinkyStitt commented Jul 9, 2024

What is the bug?

There is something wrong with how hubble is sending HubEvents to a subscription. They are supposed to be ordered by id, but sometimes they are being sent out of order.

I mad

How can it be reproduced? (optional)

I made these two simple apps that only subscribe and then log when messages are out of order: https://gist.github.com/BlinkyStitt/619706df5aac39e601ff0b5e6a85e88b. I can move it into a proper repo with the protobuf files if you need me to.

The node one runs fine, but the rust one throws errors very quickly. At first we thought this meant there was a bug in the rust library. But when I captured some packets and printed them out with wireshark, I can see the events out of order on the wire.

  • packet 30 has a message with id 452947924307972
  • packet 31 has a message with id 452947924307971
Screenshot 2024-07-08 at 1 57 54 PM Screenshot 2024-07-08 at 1 58 01 PM

capture.pcap.zip

Even though this node gist doesn't see the bug, our production node code does. In fact, one time it saw an event id out of order by more than 1 million.

@github-actions github-actions bot added the s-triage Needs to be reviewed, designed and prioritized label Jul 9, 2024
@sds sds added s-ready Ready to be picked up and removed s-triage Needs to be reviewed, designed and prioritized labels Jul 9, 2024
@sds
Copy link
Member

sds commented Jul 13, 2024

Thanks for the report!

For anyone else running into this: the underlying cause is due to a quirk of how async events are handled in the combination of Rust + TypeScript that Hubble uses. It is more likely to be hit in hubs under high load.

The fix is likely not quick/easy—it would require porting the entire Hubble gRPC server to Rust as well—a large undertaking.

For that reason, we recommend anyone who processes events should:

  • Not reject an event just because it has an earlier event ID than the latest event ID you've seen so far
  • Use a "recently seen" cache to avoid double processing events, instead of relying on the event ID

FWIW, Warpcast uses the cache technique to avoid double processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
s-ready Ready to be picked up
Projects
None yet
Development

No branches or pull requests

2 participants