Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible race condition with Boost.ASIO cancel() in timeout handling #10139

Open
julianbrost opened this issue Aug 28, 2024 · 1 comment
Open
Labels
bug Something isn't working ref/IP

Comments

@julianbrost
Copy link
Contributor

We had a customer report an issue where the master would cease to attempt to connect to some agents randomly.

14:24:55 critical/ApiListener: Timeout while reconnecting to endpoint 'agent-123' via host 'agent-123.example.com' and port '5665', cancelling attempt
14:24:55 information/ApiListener: New client connection for identity 'agent-123.example' to [10.20.30.40]:5665

However, there we no following log messages for that agent with an "Operation canceled" error as it should happen in case of a timeout, suggesting some problem with the timeout mechanism.

My current theory is that cancel() doesn't really do what we'd need it to do here:

Timeout::Ptr timeout(new Timeout(strand->context(), *strand, boost::posix_time::microseconds(int64_t(GetConnectTimeout() * 1e6)),
[sslConn, endpoint, host, port](asio::yield_context yc) {
Log(LogCritical, "ApiListener")
<< "Timeout while reconnecting to endpoint '" << endpoint->GetName() << "' via host '" << host
<< "' and port '" << port << "', cancelling attempt";
boost::system::error_code ec;
sslConn->lowest_layer().cancel(ec);
}
));
Defer cancelTimeout([&timeout]() { timeout->Cancel(); });

Note that the documentation for cancel() says (emphasis by me):

This function causes all outstanding asynchronous connect, send and receive operations to finish immediately, and the handlers for cancelled operations will be passed the boost::asio::error::operation_aborted error.

What happens if there's progress happening on the connection right when the timeout fires? Might there be no outstanding operation that could be cancelled, rending the timeout ineffective?

Customer seems to be happy since increasing the related timeout and reported no more issues, so that seems to confirm the issue being related to the timeout.

Possible fix: call shutdown() on the TCP layer instead.

ref/IP/44784

@julianbrost julianbrost added bug Something isn't working ref/IP labels Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ref/IP
Projects
None yet
Development

No branches or pull requests

2 participants
@julianbrost and others