Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Reset" to NAT instances after failover #90

Open
dan-greene-brivo opened this issue Apr 19, 2024 · 2 comments
Open

"Reset" to NAT instances after failover #90

dan-greene-brivo opened this issue Apr 19, 2024 · 2 comments

Comments

@dan-greene-brivo
Copy link

dan-greene-brivo commented Apr 19, 2024

I'm putting this here to see if there's any interest in adding in the ability to "fall back" to the NAT instances after a failover due to curl failure. Or am I missing something that will set it back automatically?

I'm working on the code anyway, so I'm happy to make a PR if you think it's useful.

Right now, my first thought is to update the connection check lambdas so that the 1st time through, it checks the route table and if it's set to a NAT Gateway, change it to a NAT instance just before the first check, so if it's still down, it'll immediately be changed back. Effective, but will cause a connectivity blip every minute while failed over to NAT Gateway.

Option 2 is to have a separate lambda on a separate schedule (maybe every 15 minutes by default, or only on demand?) that if the route tables are using NAT Gateways, we run an "Instance Refresh" on the ASG, forcing it to re-create the instances. In theory, we could terminate the instances, and the ASG would do it's thing as well.

Thoughts?

@bwhaley
Copy link
Member

bwhaley commented Apr 20, 2024

If I understand correctly, what you're proposing is the following:

  1. NAT instance fails connectivity checks for some reason.
  2. Connectivity checker Lambda notices the failure and replaces the route to go through the NAT gateway.
  3. Now the NAT instance is sitting around doing nothing.
  4. Some time later, the NAT instance is able to connect again.
  5. There should be a process to automatically switch back to the NAT instance.

Did I understand correctly?

If so, the first challenge is to how to know that the NAT instance has connectivity again. The route table now points to the NAT gateway. You'd need either:

  1. Another, different route table that points to the NAT instance. Have a Lambda that is in a subnet that uses this route table. Have it checking connectivity. If connectivity succeeds, update the route to the instance again.
  2. Or, have the NAT instance itself check its connection and update the route once connectivity is working.

I don't think we can use a solution like your first proposal because we do not want a "connectivity blip" - remaining connected is our highest priority. Remember that the connectivity checker runs every minute (by default) so you'd be interrupting the connection quite a lot, potentially, if the NAT instance is still broken.

Option (2) could have sorta the same problem. It could trigger an instance replacement, and the new instances would automatically claim the route at boot, as usual. But if it can't connect because the problem is somewhere else (e.g. the connectivity failure is not due to the NAT instance itself, but some AWS networking issue), then you'd end up in a loop where the new Lambda runs again, finds the NAT gateway as the route, terminates the instance, rinse & repeat.

I like the idea of a self-healing NAT instance, just need to find a practical approach.

@dan-greene-brivo
Copy link
Author

I’ll start with just a lambda that resets the system while we figure out the least impactful time/mechanism to call it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants