Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pulse-go doesn't recover from dropped amqp #7

Open
escapewindow opened this issue Jan 12, 2022 · 9 comments
Open

pulse-go doesn't recover from dropped amqp #7

escapewindow opened this issue Jan 12, 2022 · 9 comments

Comments

@escapewindow
Copy link

escapewindow commented Jan 12, 2022

It looks like we drop a message in the logs:

fmt.Println("AMQP channel closed - has the connection dropped?")

This resulted in two instances of cloudops-jenkins not responding to hg.m.o events, which halts rolling out changes to the FirefoxCI tc cluster.

We were wondering if we could either

  • add louder notifications: a slack alert, email, ?
  • auto-recover, whether that's killing pulse-go for a restart, killing the container for a restart, reconnecting to amqp, ? I'm not sure if this would be on the first failure or after t time or n failed attempts or what.

or both.

@petemoore any thoughts?

@petemoore
Copy link
Member

petemoore commented Jan 13, 2022

Oh wow, I had no idea this is used anywhere in production. It wasn't really intended for production use, but that should have probably been written explicitly in the README. Presumably, somebody found it and started using it. Apologies about that. I wrote it back in 2015 before we even had generic-worker mostly as a command line utility during development for sniffing pulse messages to help troubleshoot issues.

My memory is a little hazy about the internals, but yes, it seems sensible to try to reconnect if the connections drops, or to exit the process, rather than let it remain running but not working.

@petemoore
Copy link
Member

petemoore commented Jan 13, 2022

By the way, if you are using it to listen to taskcluster exchanges, there are bindings available in the taskcluster go client, e.g. see this test example.

In any case, if this really is to be used in production, we should certainly harden it, and make sure it is tested for robustness.

@jbuck
Copy link

jbuck commented Jan 13, 2022

Heh, it's being used by https://github.com/mozilla-services/cloudops-deployment-proxy for listening for changes to ci-admin/ci-configuration and then triggering Jenkins builds

@escapewindow
Copy link
Author

escapewindow commented Jan 13, 2022

@jbuck if we have pulse-go exit on failure, can we then reattach with another process easily? If so, that seems like a fairly straightforward solution.

Other options: pulse-go reattaches; we switch cloudops-deployment-proxy to the taskcluster go client. Between those two I'd probably look at taskcluster go client first, but I defer to the two of you :)

@petemoore
Copy link
Member

I think I'd agree - explicitly exiting in pulse-go if the connection drops, together with a script like:

#!/bin/bash

# if pulse-go dies, restart it - can happen if amqp connection drops
while true; do
  pulse-go ..... || true
done

@jbuck
Copy link

jbuck commented Jan 13, 2022

it looks like we're using the go integration directly, not a separate process. If the go library killed the process though that'd be fine, we can just tell systemd to restart infinitely

@escapewindow
Copy link
Author

Do we replace the fmt.Println() with a panic() ? Would it make sense to have a counter+sleep and die on attempt 5 or something, so we avoid going in an infinite crash/restart loop?

@petemoore
Copy link
Member

petemoore commented Jan 14, 2022

I think the cleanest would be for the library to signal that the connection dropped (e.g. by closing a channel), and the calling code to receive the signal and then it choose to panic or e.g. call os.Exit. The issue with having a panic directly inside the library code (where the issue is detected) is that other apps that import the library may not want their process to be terminated if the connection drops.

@jbuck Can you link to the go code that imports the library?

@jcristau
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants