Impossible to trigger error "Failed to start coordinator client" #131

mszulcz-mitre · 2022-06-13T06:23:47Z

Affected Branch

trunk

Basic Diagnostics

I've pulled the latest changes on the affected branch and the issue is still present.
The issue is reproducible in docker

Description

Consider the following branch in sentinel_2pc::controller::init(src/uhs/twophase/sentinel_2pc/controller.cpp, lines 38-41):

        if(!m_coordinator_client.init()) {
            m_logger->error("Failed to start coordinator client");
            return false;
        }

The top line checks if the coordinator client fails to start. If it does fail, the next lines should log the error "Failed to start coordinator client" and should return from the method with a value of false. Surprisingly, it is impossible to execute this branch because m_coordinator_client.init always returns true.

The culprit seems to be rpc::tcp_client::init (src/util/rpc/tcp_client.hpp, lines 54 - 68), which is in the call stack of m_coordinator_client.init. According to its description, rpc::tcp_client::init “connects to … server endpoints”. Its body indicates that if the connection fails, it should return false:

     [[nodiscard]] auto init() -> bool {
            if(!m_net.cluster_connect(m_server_endpoints, false)) {
                return false;
            }
	…
        }

However, it can never return false. This is due to the fact that cluster_connect is called with its 2nd argument equal to false. Since cluster_connect(m_server_endpoints, false) always returns true, its caller rpc::tcp_client::init always returns true, and ultimately m_coordinator_client.init always returns true. Here's a summary of the call stack:

m_net.cluster_connect(m_server_endpoints, false) (always true)
- rpc::tcp_client::init (always true)
  - coordinator::rpc::init (always true)
    - m_coordinator_client.init (always true)

Because sentinel_2pc::controller::init never raises an error if the coordinator client fails to start, an infinite loop can be triggered. The loop is in the body of sentinel_2pc::controller::send_compact_tx (src/uhs/twophase/sentinel_2pc/controller.cpp, lines 194-203):

    // TODO: add a "retry" error response to offload sentinels from this
    //       infinite retry responsibility.
    while(!m_coordinator_client.execute_transaction(ctx, cb)) {
        // TODO: the network currently doesn't provide a callback for
        //       reconnection events so we have to sleep here to
        //       prevent a needless spin. Instead, add such a callback
        //       or queue to the network to remove this sleep.
        static constexpr auto retry_delay = std::chrono::milliseconds(100);
        std::this_thread::sleep_for(retry_delay);
    };

When the coordinator client fails to start but an error is not raised, calls to m_coordinator_client.execute_transaction will return false and the loop will execute indefinitely. The author apparently recognized this possibility, but there’s currently no code to log error or debug messages and gracefully exit the loop.

Triggering the infinite loop is easy: just provide a junk coordinator IP address (e.g. “abcdefg”) for the unit tests in controller_test.cpp. Specifically, replace lines 24-25 in SetUp in controller_test.cpp with the following:

        const auto coordinator_ep
            = std::make_pair("abcdefghijklmnopqrstuvwxyz", m_coordinator_port);

How should this be addressed? It seems reasonable to just replace the call to m_net.cluster_connect(m_server_endpoints, false) with m_net.cluster_connect(m_server_endpoints, true), which would allow it to return false if the cluster connection fails. This would allow sentinel_2pc::controller::init to fail if the coordinator client fails to start and would preclude triggering the infinite loop.

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

HalosGhost · 2022-06-17T16:56:31Z

It seems reasonable to just replace [false] with [true]…

I agree; at least on the surface, that seems eminently reasonable. It's been a little while since I poked the rpc layer; so, out of interest, have you given a shot to changing that? Did you run into any immediate issues?

This commit addresses GitHub Issue mit-dci#131. That issue describes how it's impossible to trigger the error "Failed to start coordinator client" in sentinel_2pc::controller::init (src/uhs/twophase/sentinel_2pc/controller.cpp) because the call to cluster_connect in rpc::tcp_client::init always returns true. This commit changes the call to allow cluster_connect to return either true or false. Signed-off-by: Michael L. Szulczewski <[email protected]>

A previous commit fixed the bug described in GitHub Issue mit-dci#131. That change caused the unit test tcp_rpc_test.send_fail_test in tcp_test.cpp to fail. This commit enables the test to pass by changing an assertion to test for new behavior resulting from the bug fix. Signed-off-by: Michael L. Szulczewski <[email protected]>

GitHub Issue mit-dci#131 identified a bug that made it impossible to trigger the error "Failed to start coordinator client" in sentinel_2pc::controller::init. Since a previous commit fixed the bug, the error can now be triggered. This commit adds a unit test that specifically triggers the error. Signed-off-by: Michael L. Szulczewski <[email protected]>

mszulcz-mitre · 2022-06-19T06:32:13Z

Unfortunately, changing false to true causes a unit test to fail. The good news is that, as written, I think it should fail. Here’s the test up to the failing assertion (tests/unit/rpc/tcp_test.cpp):

TEST(tcp_rpc_test, send_fail_test) {
    using request = int64_t;
    using response = int64_t;

    auto client = cbdc::rpc::tcp_client<request, response>(
        {{cbdc::network::localhost, 55555},
         {cbdc::network::localhost, 55556}});
    ASSERT_TRUE(client.init());

Before I changed false to true, client.init() was guaranteed to always return true and the assertion always passed. After the change, it became possible for it to return false and the assertion failed. In this test, I think it actually should fail because the test doesn’t instantiate a server for the client to connect to. Without a server, the client initialization fails. This failure to connect to a server was always happening before the bug fix, but the failure was suppressed by the fact that cluster_connect in rpc::tcp_client::init was being forced to always return true.

mszulcz-mitre · 2022-06-19T06:36:26Z

I made the proposed change in Pull Request #135. In addition to the bug fix itself, the PR modifies the unit test that failed to account for the new behavior expected after the bug fix. Finally, the PR adds a new unit test to trigger the error that was previously impossible.

Right now the PR is a draft. It's passed all the checks, so if you agree with the changes we're discussing, I'll click "Ready for review".

GitHub Issue mit-dci#131 identified a bug that made it impossible to trigger the error "Failed to start coordinator client" in sentinel_2pc::controller::init. Since a previous commit fixed the bug, the error can now be triggered. This commit adds a unit test that specifically triggers the error. Signed-off-by: Michael L. Szulczewski <[email protected]>

sentinel_2pc::controller::init contains a bug that makes it impossible to trigger the error "Failed to start coordinator client" (see GitHub Issue mit-dci#131). This commit fixes the bug using the method described in Issue mit-dci#131. It adds a new unit test to test triggering the error, and also modifies a different unit test to account for behavior that's expected after the bug fix. Signed-off-by: Michael L. Szulczewski <[email protected]>

sentinel_2pc::controller::init contains a bug that makes it impossible to trigger the error "Failed to start coordinator client" (see GitHub Issue #131). This commit fixes the bug using the method described in Issue #131. It adds a new unit test to test triggering the error, and also modifies a different unit test to account for behavior that's expected after the bug fix. Signed-off-by: Michael L. Szulczewski <[email protected]>

mszulcz-mitre added the fix/bug Fixes errant behavior label Jun 13, 2022

mszulcz-mitre assigned HalosGhost Jun 13, 2022

HalosGhost removed their assignment Jun 17, 2022

mszulcz-mitre mentioned this issue Jun 19, 2022

tcp_client::init bug fix #135

Merged

HalosGhost linked a pull request Aug 23, 2022 that will close this issue

tcp_client::init bug fix #135

Merged

HalosGhost closed this as completed in #135 Aug 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impossible to trigger error "Failed to start coordinator client" #131

Impossible to trigger error "Failed to start coordinator client" #131

mszulcz-mitre commented Jun 13, 2022

HalosGhost commented Jun 17, 2022

mszulcz-mitre commented Jun 19, 2022

mszulcz-mitre commented Jun 19, 2022 •

edited

Loading

Impossible to trigger error "Failed to start coordinator client" #131

Impossible to trigger error "Failed to start coordinator client" #131

Comments

mszulcz-mitre commented Jun 13, 2022

Affected Branch

Basic Diagnostics

Description

Code of Conduct

HalosGhost commented Jun 17, 2022

mszulcz-mitre commented Jun 19, 2022

mszulcz-mitre commented Jun 19, 2022 • edited Loading

mszulcz-mitre commented Jun 19, 2022 •

edited

Loading