Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection seems to hang/be stuck until connection is closed #140

Closed
Wouter0100 opened this issue Nov 7, 2023 · 8 comments · Fixed by #153
Closed

Connection seems to hang/be stuck until connection is closed #140

Wouter0100 opened this issue Nov 7, 2023 · 8 comments · Fixed by #153

Comments

@Wouter0100
Copy link

We have been debugging an issue that we're stuck with debugging, and we'd love to ask for your help to continue debugging this issue.

Unfortunately, it seems that we have connection issues while using AMQProxy. Sometimes, at seemingly random intervals, it seems our workers (that "work" on messages in queues) get stuck. They hang, with literally no CPU usage (0m). As this was at random intervals (sometimes weeks in between these issues), we were unable to reproduce it easily.

Because of the above, we started to implement a message in each queue that would send a GET request to Healthchecks.io. Then, we would know when a worker was hanging. Now, we found out that this scheduler (that dispatches these messages in the queues and does not listen on any queues) experiences seemingly the exact same issue: it succeeds to dispatch a message to the RabbitMQ queue a couple of times, and then - constantly - hangs at rhougly the same point after 60 seconds.

When the scheduler hangs, we're unable to send any signals to it and the CPU usage is also stuck at 0m. My thinking here is that it waits for a socket to respond something. When I kill the AMQProxy (at any time, even 3 days later), it all of a sudden logs this error:

[2023-11-07T13:07:02.280204+00:00] messenger.CRITICAL: Error thrown while handling message App\Message\Healthcheck\QueuePing. Removing from transport after 0 retries. Error: "Handling "App\Message\Healthcheck\QueuePing" failed: Library error: connection closed unexpectedly" {"class":"App\\Message\\Healthcheck\\QueuePing","retryCount":0,"error":"Handling \"App\\Message\\Healthcheck\\QueuePing\" failed: Library error: connection closed unexpectedly","exception":"[object] (Symfony\\Component\\Messenger\\Exception\\HandlerFailedException(code: 0): Handling \"App\\Message\\Healthcheck\\QueuePing\" failed: Library error: connection closed unexpectedly at /app/vendor/symfony/doctrine-bridge/Messenger/DoctrineTransactionMiddleware.php:42)\n[previous exception] [object] (Symfony\\Component\\Messenger\\Exception\\TransportException(code: 0): Library error: connection closed unexpectedly at /app/vendor/symfony/amqp-messenger/Transport/AmqpSender.php:74)\n[previous exception] [object] (AMQPException(code: 0): Library error: connection closed unexpectedly at /app/vendor/symfony/amqp-messenger/Transport/Connection.php:341)"} []

Interestingly enough, it seems to always happen after 60 seconds. When we configure the scheduler to dispatch 3 messages every 1 minute (to 3 different queues), the problem appeared after 60 seconds. When we configured it to do that every 10 or 1 seconds, the problems also appeared after 60 seconds. However, it is very interesting that connecting directly to RabbitMQ works and does not hang after 60 seconds.

After we enabled --debug, it did not show more useful information. The only thing we saw was:

2023-11-07 13:02:51 UTC: Client connected: 100.64.1.94:60728
2023-11-07 13:02:52 UTC: Client connected: 100.64.1.94:60730
2023-11-07 13:02:52 UTC: Client connected: 100.64.1.94:60732

With no disconnects, errors or other issues in the AMQProxy logs after 60 seconds.

What we tried is:

  • Connecting to RabbitMQ via AMQProxy (both hosted in Kubernetes): reproduceable
  • Directly connecting to RabbitMQ (hosted in Kubernetes): no problems
  • Connecting to CloudAMQP via AMQProxy (AMQProxy hosted in Kubernetes): reproduceable
  • Directly connecting to CloudAMQP: no problems

We are using the PHP amqp extension V1.11.0.

@dentarg
Copy link
Member

dentarg commented Nov 7, 2023

@Wouter0100 can you share which version of amqproxy you are using?

@Wouter0100
Copy link
Author

Ah, sure! I should've added that to begin with.

We first tested with 0.8.2, but upgraded towards the latest 0.8.14 to ensure there wasn't a release that would fix this. The same for RabbitMQ, where we upgraded from 3.10.2 to 3.12.8 to ensure RabbitMQ wasn't the problem.

@dentarg
Copy link
Member

dentarg commented Nov 7, 2023

Can you share some code that demonstrates the problem? It would help us in investigating the issue.

This repo actually have some testing with the PHP extension. Maybe it can be extended or modified to show the problem.

@carlhoerberg
Copy link
Member

What's the heartbeat value the php client negotiate? Is it 60s? Might be that we don't send a heartbeat frame as the client expects...

@Wouter0100
Copy link
Author

I've been trying to reproduce this issue by the way, and create a proper test for it - but thus far I have been unable to reproduce it within a test, interestingly enough. I will be putting some additional time into it rather soon.

@syntaxerro
Copy link

We have the same problem, last night our scheduler hang. At first, he dispatched some messages correctly, but they did not reach the rabbit server. Then he stopped dispatching completely.

An additional entry then appeared in the logs
2024-02-28T19:10:25.450706Z INFO amq_proxy.channel_pool[remote_address: "[::1]:59742"] Adding upstream connection

I'm not sure it's a problem with amqproxy, or the client that does disptach, or the rabbit cluster itself. We will also debug and look for the cause.

@dentarg
Copy link
Member

dentarg commented Feb 29, 2024

Looks like you are using v2.0.0-rc.1 @syntaxerro from the logs you shared, which was tagged very recently (2024-02-20). You might want to try v1.0.0 it this issue becomes a blocker for you, but any debugging / more info you can provide would be appreciated.

@dentarg
Copy link
Member

dentarg commented Feb 29, 2024

What's the heartbeat value the php client negotiate? Is it 60s? Might be that we don't send a heartbeat frame as the client expects...

@carlhoerberg https://github.com/php-amqp/php-amqp is using the C library and the default seems to be heartbeats disabled

There's this comment in stubs/AMQPConnection.php saying

Note, librabbitmq has limited heartbeat support, which means heartbeats checked only during blocking calls

There's also this not that old issue alanxz/rabbitmq-c#788 discussing a bug and fix (alanxz/rabbitmq-c@08fb168) that hasn't been released yet

carlhoerberg added a commit that referenced this issue Mar 9, 2024
Server should send heartbeat frames to client depending on the
negotiated heartbeat interval.

Probably fixes #140
carlhoerberg added a commit that referenced this issue Mar 9, 2024
Server should send heartbeat frames to client depending on the
negotiated heartbeat interval.

Probably fixes #140
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants