-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AlterNAT at scale Questions #108
Comments
Thanks for the question. We have observed non-zero values in the ENA metrics, namely These most likely happen because of microbursts in traffic. It's pretty hard to avoid. You can keep upsizing instances until you get to the max to see if that resolves it. It will definitely help, as you observed and as AWS states, such as in this article. If it doesn't, fixing microbursts can be challenging. This article mentions some advanced strategies for mitigating bursts if you cannot scale horizontally (e.g. in your case since you have the constraint of a single subnet/route table), but those approaches are not going to work with Lambdas. If you haven't already seen it, this article discusses how you can measure pps limits if you want to test different instance types. You may also be able to set up packet captures and look for retransmits to see how widespread the problem actually is. I opened #107 to make it easier to expose these metrics with Alternat. |
@thedoomcrewinc Does this help at all? Are you going to try some larger instances or anything as a next step? |
@bwhaley Apologies for the delay in update. We're in a push for our Back to School effort, and I won't be able to test this until after Aug 1st. I'll update shortly there after. |
Follow up as promised: After testing various instances classes and size combos up to a c7gn.16xlarge, we determined that the sweet spot was a c6gn.8xlarge instance. We too observed non-zero values, however like you observed and commented on, we can safely say we are not worrying about the impact. Microbursts do occur but in general we don't worry about it. I'll report back in September after the vast majority of schools are back in session and our traffic levels have stabilized at the new "normal" |
Thanks, I appreciate that you're following up here! |
Our Dev/Staging environments quite frankly suck, low traffic at best and in testing we encountered no issues at all (this is good)
When we tried to simulate a production environment, AlterNAT we noticed the instance would start to drop traffic when we reached higher Lambda Execution volumes. We hit the instance's PPS limit and packets were dropped.
Increasing the NAT instance class seemed to help.
Our Production environment is a different beast.
The vast majority of our NAT traffic is from Lambda executions. (Occasionally bursting past 300,000 executions per minute)
I'm concerned about hitting a PPS Limit and having drops in production.
Since you stated you send several PB of traffic, I'm going to guess your traffic is a lot more than ours (it would make sense)
Our short lived lambdas (for SSO, Dynamodb lookups, Small API Requests) are all quick in and out, but our long running lambdas can run upwards of 6 minutes (Data continues to flow to/from the browser during this so the pps do not stop)
Without going into specifics:
All of our lambda's use a single subnet with a NAT Gateway in that subnet. I unfortunately cannot do that as re-engineering the architecture is not feasible until winter 2025.
(This has been transferred from an email with @bwhaley for public visibility and comments)
The text was updated successfully, but these errors were encountered: