-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Impossible to trigger error "Failed to start coordinator client" #131
Comments
I agree; at least on the surface, that seems eminently reasonable. It's been a little while since I poked the rpc layer; so, out of interest, have you given a shot to changing that? Did you run into any immediate issues? |
This commit addresses GitHub Issue mit-dci#131. That issue describes how it's impossible to trigger the error "Failed to start coordinator client" in sentinel_2pc::controller::init (src/uhs/twophase/sentinel_2pc/controller.cpp) because the call to cluster_connect in rpc::tcp_client::init always returns true. This commit changes the call to allow cluster_connect to return either true or false. Signed-off-by: Michael L. Szulczewski <[email protected]>
A previous commit fixed the bug described in GitHub Issue mit-dci#131. That change caused the unit test tcp_rpc_test.send_fail_test in tcp_test.cpp to fail. This commit enables the test to pass by changing an assertion to test for new behavior resulting from the bug fix. Signed-off-by: Michael L. Szulczewski <[email protected]>
GitHub Issue mit-dci#131 identified a bug that made it impossible to trigger the error "Failed to start coordinator client" in sentinel_2pc::controller::init. Since a previous commit fixed the bug, the error can now be triggered. This commit adds a unit test that specifically triggers the error. Signed-off-by: Michael L. Szulczewski <[email protected]>
Unfortunately, changing false to true causes a unit test to fail. The good news is that, as written, I think it should fail. Here’s the test up to the failing assertion (tests/unit/rpc/tcp_test.cpp):
Before I changed false to true, |
I made the proposed change in Pull Request #135. In addition to the bug fix itself, the PR modifies the unit test that failed to account for the new behavior expected after the bug fix. Finally, the PR adds a new unit test to trigger the error that was previously impossible. Right now the PR is a draft. It's passed all the checks, so if you agree with the changes we're discussing, I'll click "Ready for review". |
GitHub Issue mit-dci#131 identified a bug that made it impossible to trigger the error "Failed to start coordinator client" in sentinel_2pc::controller::init. Since a previous commit fixed the bug, the error can now be triggered. This commit adds a unit test that specifically triggers the error. Signed-off-by: Michael L. Szulczewski <[email protected]>
sentinel_2pc::controller::init contains a bug that makes it impossible to trigger the error "Failed to start coordinator client" (see GitHub Issue mit-dci#131). This commit fixes the bug using the method described in Issue mit-dci#131. It adds a new unit test to test triggering the error, and also modifies a different unit test to account for behavior that's expected after the bug fix. Signed-off-by: Michael L. Szulczewski <[email protected]>
sentinel_2pc::controller::init contains a bug that makes it impossible to trigger the error "Failed to start coordinator client" (see GitHub Issue mit-dci#131). This commit fixes the bug using the method described in Issue mit-dci#131. It adds a new unit test to test triggering the error, and also modifies a different unit test to account for behavior that's expected after the bug fix. Signed-off-by: Michael L. Szulczewski <[email protected]>
sentinel_2pc::controller::init contains a bug that makes it impossible to trigger the error "Failed to start coordinator client" (see GitHub Issue mit-dci#131). This commit fixes the bug using the method described in Issue mit-dci#131. It adds a new unit test to test triggering the error, and also modifies a different unit test to account for behavior that's expected after the bug fix. Signed-off-by: Michael L. Szulczewski <[email protected]>
sentinel_2pc::controller::init contains a bug that makes it impossible to trigger the error "Failed to start coordinator client" (see GitHub Issue #131). This commit fixes the bug using the method described in Issue #131. It adds a new unit test to test triggering the error, and also modifies a different unit test to account for behavior that's expected after the bug fix. Signed-off-by: Michael L. Szulczewski <[email protected]>
Affected Branch
trunk
Basic Diagnostics
I've pulled the latest changes on the affected branch and the issue is still present.
The issue is reproducible in docker
Description
Consider the following branch in
sentinel_2pc::controller::init
(src/uhs/twophase/sentinel_2pc/controller.cpp, lines 38-41):The top line checks if the coordinator client fails to start. If it does fail, the next lines should log the error "Failed to start coordinator client" and should return from the method with a value of false. Surprisingly, it is impossible to execute this branch because
m_coordinator_client.init
always returns true.The culprit seems to be
rpc::tcp_client::init
(src/util/rpc/tcp_client.hpp, lines 54 - 68), which is in the call stack ofm_coordinator_client.init
. According to its description,rpc::tcp_client::init
“connects to … server endpoints”. Its body indicates that if the connection fails, it should return false:However, it can never return false. This is due to the fact that
cluster_connect
is called with its 2nd argument equal to false. Sincecluster_connect(m_server_endpoints, false)
always returns true, its callerrpc::tcp_client::init
always returns true, and ultimatelym_coordinator_client.init
always returns true. Here's a summary of the call stack:Because
sentinel_2pc::controller::init
never raises an error if the coordinator client fails to start, an infinite loop can be triggered. The loop is in the body ofsentinel_2pc::controller::send_compact_tx
(src/uhs/twophase/sentinel_2pc/controller.cpp, lines 194-203):When the coordinator client fails to start but an error is not raised, calls to
m_coordinator_client.execute_transaction
will return false and the loop will execute indefinitely. The author apparently recognized this possibility, but there’s currently no code to log error or debug messages and gracefully exit the loop.Triggering the infinite loop is easy: just provide a junk coordinator IP address (e.g. “abcdefg”) for the unit tests in controller_test.cpp. Specifically, replace lines 24-25 in
SetUp
in controller_test.cpp with the following:How should this be addressed? It seems reasonable to just replace the call to
m_net.cluster_connect(m_server_endpoints, false)
withm_net.cluster_connect(m_server_endpoints, true)
, which would allow it to return false if the cluster connection fails. This would allowsentinel_2pc::controller::init
to fail if the coordinator client fails to start and would preclude triggering the infinite loop.Code of Conduct
The text was updated successfully, but these errors were encountered: