Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impossible to trigger error "Failed to start coordinator client" #131

Closed
2 of 3 tasks
mszulcz-mitre opened this issue Jun 13, 2022 · 3 comments · Fixed by #135
Closed
2 of 3 tasks

Impossible to trigger error "Failed to start coordinator client" #131

mszulcz-mitre opened this issue Jun 13, 2022 · 3 comments · Fixed by #135
Labels
fix/bug Fixes errant behavior

Comments

@mszulcz-mitre
Copy link
Contributor

Affected Branch

trunk

Basic Diagnostics

  • I've pulled the latest changes on the affected branch and the issue is still present.

  • The issue is reproducible in docker

Description

Consider the following branch in sentinel_2pc::controller::init(src/uhs/twophase/sentinel_2pc/controller.cpp, lines 38-41):

        if(!m_coordinator_client.init()) {
            m_logger->error("Failed to start coordinator client");
            return false;
        }

The top line checks if the coordinator client fails to start. If it does fail, the next lines should log the error "Failed to start coordinator client" and should return from the method with a value of false. Surprisingly, it is impossible to execute this branch because m_coordinator_client.init always returns true.

The culprit seems to be rpc::tcp_client::init (src/util/rpc/tcp_client.hpp, lines 54 - 68), which is in the call stack of m_coordinator_client.init. According to its description, rpc::tcp_client::init “connects to … server endpoints”. Its body indicates that if the connection fails, it should return false:

     [[nodiscard]] auto init() -> bool {
            if(!m_net.cluster_connect(m_server_endpoints, false)) {
                return false;
            }
	…
        }

However, it can never return false. This is due to the fact that cluster_connect is called with its 2nd argument equal to false. Since cluster_connect(m_server_endpoints, false) always returns true, its caller rpc::tcp_client::init always returns true, and ultimately m_coordinator_client.init always returns true. Here's a summary of the call stack:

  • m_net.cluster_connect(m_server_endpoints, false) (always true)
    • rpc::tcp_client::init (always true)
      • coordinator::rpc::init (always true)
        • m_coordinator_client.init (always true)

Because sentinel_2pc::controller::init never raises an error if the coordinator client fails to start, an infinite loop can be triggered. The loop is in the body of sentinel_2pc::controller::send_compact_tx (src/uhs/twophase/sentinel_2pc/controller.cpp, lines 194-203):

    // TODO: add a "retry" error response to offload sentinels from this
    //       infinite retry responsibility.
    while(!m_coordinator_client.execute_transaction(ctx, cb)) {
        // TODO: the network currently doesn't provide a callback for
        //       reconnection events so we have to sleep here to
        //       prevent a needless spin. Instead, add such a callback
        //       or queue to the network to remove this sleep.
        static constexpr auto retry_delay = std::chrono::milliseconds(100);
        std::this_thread::sleep_for(retry_delay);
    };

When the coordinator client fails to start but an error is not raised, calls to m_coordinator_client.execute_transaction will return false and the loop will execute indefinitely. The author apparently recognized this possibility, but there’s currently no code to log error or debug messages and gracefully exit the loop.

Triggering the infinite loop is easy: just provide a junk coordinator IP address (e.g. “abcdefg”) for the unit tests in controller_test.cpp. Specifically, replace lines 24-25 in SetUp in controller_test.cpp with the following:

        const auto coordinator_ep
            = std::make_pair("abcdefghijklmnopqrstuvwxyz", m_coordinator_port);

How should this be addressed? It seems reasonable to just replace the call to m_net.cluster_connect(m_server_endpoints, false) with m_net.cluster_connect(m_server_endpoints, true), which would allow it to return false if the cluster connection fails. This would allow sentinel_2pc::controller::init to fail if the coordinator client fails to start and would preclude triggering the infinite loop.

Code of Conduct

  • I agree to follow this project's Code of Conduct
@mszulcz-mitre mszulcz-mitre added the fix/bug Fixes errant behavior label Jun 13, 2022
@HalosGhost
Copy link
Collaborator

It seems reasonable to just replace [false] with [true]…

I agree; at least on the surface, that seems eminently reasonable. It's been a little while since I poked the rpc layer; so, out of interest, have you given a shot to changing that? Did you run into any immediate issues?

@HalosGhost HalosGhost removed their assignment Jun 17, 2022
mszulcz-mitre added a commit to mszulcz-mitre/opencbdc-tx that referenced this issue Jun 19, 2022
This commit addresses GitHub Issue mit-dci#131.  That issue describes how it's
impossible to trigger the error "Failed to start coordinator client" in
sentinel_2pc::controller::init
(src/uhs/twophase/sentinel_2pc/controller.cpp) because the call to
cluster_connect in rpc::tcp_client::init always returns true.  This
commit changes the call to allow cluster_connect to return either true
or false.

Signed-off-by: Michael L. Szulczewski <[email protected]>
mszulcz-mitre added a commit to mszulcz-mitre/opencbdc-tx that referenced this issue Jun 19, 2022
A previous commit fixed the bug described in GitHub Issue mit-dci#131.
That change caused the unit test tcp_rpc_test.send_fail_test
in tcp_test.cpp to fail.  This commit enables the test to pass by
changing an assertion to test for new behavior resulting from the bug
fix.

Signed-off-by: Michael L. Szulczewski <[email protected]>
mszulcz-mitre added a commit to mszulcz-mitre/opencbdc-tx that referenced this issue Jun 19, 2022
GitHub Issue mit-dci#131 identified a bug that made it impossible to trigger
the error "Failed to start coordinator client"
in sentinel_2pc::controller::init.  Since a previous commit fixed the bug,
the error can now be triggered.  This commit adds a unit test that
specifically triggers the error.

Signed-off-by: Michael L. Szulczewski <[email protected]>
@mszulcz-mitre
Copy link
Contributor Author

Unfortunately, changing false to true causes a unit test to fail. The good news is that, as written, I think it should fail. Here’s the test up to the failing assertion (tests/unit/rpc/tcp_test.cpp):

TEST(tcp_rpc_test, send_fail_test) {
    using request = int64_t;
    using response = int64_t;

    auto client = cbdc::rpc::tcp_client<request, response>(
        {{cbdc::network::localhost, 55555},
         {cbdc::network::localhost, 55556}});
    ASSERT_TRUE(client.init());

Before I changed false to true, client.init() was guaranteed to always return true and the assertion always passed. After the change, it became possible for it to return false and the assertion failed. In this test, I think it actually should fail because the test doesn’t instantiate a server for the client to connect to. Without a server, the client initialization fails. This failure to connect to a server was always happening before the bug fix, but the failure was suppressed by the fact that cluster_connect in rpc::tcp_client::init was being forced to always return true.

@mszulcz-mitre
Copy link
Contributor Author

mszulcz-mitre commented Jun 19, 2022

I made the proposed change in Pull Request #135. In addition to the bug fix itself, the PR modifies the unit test that failed to account for the new behavior expected after the bug fix. Finally, the PR adds a new unit test to trigger the error that was previously impossible.

Right now the PR is a draft. It's passed all the checks, so if you agree with the changes we're discussing, I'll click "Ready for review".

mszulcz-mitre added a commit to mszulcz-mitre/opencbdc-tx that referenced this issue Jun 27, 2022
GitHub Issue mit-dci#131 identified a bug that made it impossible to trigger
the error "Failed to start coordinator client"
in sentinel_2pc::controller::init.  Since a previous commit fixed the bug,
the error can now be triggered.  This commit adds a unit test that
specifically triggers the error.

Signed-off-by: Michael L. Szulczewski <[email protected]>
mszulcz-mitre added a commit to mszulcz-mitre/opencbdc-tx that referenced this issue Jul 12, 2022
sentinel_2pc::controller::init contains a bug that makes it impossible to
trigger the error "Failed to start coordinator client" (see GitHub Issue mit-dci#131).
This commit fixes the bug using the method described in Issue mit-dci#131.  It
adds a new unit test to test triggering the error, and also modifies a
different unit test to account for behavior that's expected after the bug fix.

Signed-off-by: Michael L. Szulczewski <[email protected]>
mszulcz-mitre added a commit to mszulcz-mitre/opencbdc-tx that referenced this issue Aug 16, 2022
sentinel_2pc::controller::init contains a bug that makes it impossible to
trigger the error "Failed to start coordinator client" (see GitHub Issue mit-dci#131).
This commit fixes the bug using the method described in Issue mit-dci#131.  It
adds a new unit test to test triggering the error, and also modifies a
different unit test to account for behavior that's expected after the bug fix.

Signed-off-by: Michael L. Szulczewski <[email protected]>
mszulcz-mitre added a commit to mszulcz-mitre/opencbdc-tx that referenced this issue Aug 19, 2022
sentinel_2pc::controller::init contains a bug that makes it impossible to
trigger the error "Failed to start coordinator client" (see GitHub Issue mit-dci#131).
This commit fixes the bug using the method described in Issue mit-dci#131.  It
adds a new unit test to test triggering the error, and also modifies a
different unit test to account for behavior that's expected after the bug fix.

Signed-off-by: Michael L. Szulczewski <[email protected]>
@HalosGhost HalosGhost linked a pull request Aug 23, 2022 that will close this issue
HalosGhost pushed a commit that referenced this issue Aug 23, 2022
sentinel_2pc::controller::init contains a bug that makes it impossible to
trigger the error "Failed to start coordinator client" (see GitHub Issue #131).
This commit fixes the bug using the method described in Issue #131.  It
adds a new unit test to test triggering the error, and also modifies a
different unit test to account for behavior that's expected after the bug fix.

Signed-off-by: Michael L. Szulczewski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix/bug Fixes errant behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants