Part of "Run the Code" section in README.md appears broken #186

mszulcz-mitre · 2022-09-27T18:25:38Z

Affected Branch

trunk

Basic Diagnostics

I've pulled the latest changes on the affected branch and the issue is still present.
The issue is reproducible in docker

Description

In the "Run the Code" section of README.md, the following step hangs:

Send currency from the first wallet to the second wallet created in the previous step (e.g., 30 atomic units of currency)
# ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool0.dat wallet0.dat send 30 <wallet address>

where <wallet address> is the address returned from running the client-cli with the newaddress keyword.

I followed the instructions in the sections "Launch the System" and "Setup test wallets and test them" for the 2PC architecture. The instructions are:

Run the System
# docker compose --file docker-compose-2pc.yml up --build
Launch a container in which to run wallet commands (use --network atomizer-network instead of --network 2pc-network if using the atomizer architecture)
# docker run --network 2pc-network -ti opencbdc-tx /bin/bash
Mint new coins (e.g., 10 new UTXOs each with a value of 5 atomic units of currency)
# ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool0.dat wallet0.dat mint 10 5
[2021-08-17 15:11:57.686] [WARN ] Existing wallet file not found
[2021-08-17 15:11:57.686] [WARN ] Existing mempool not found
4bc23da407c3a8110145c5b6c38199c8ec3b0e35ea66bbfd78f0ed65304ce6fa
Inspect the balance of a wallet
# ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool0.dat wallet0.dat info
Balance: $0.50, UTXOs: 10, pending TXs: 0
Make a new wallet
# ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool1.dat wallet1.dat newaddress
[2021-08-17 15:13:16.148] [WARN ] Existing wallet file not found
[2021-08-17 15:13:16.148] [WARN ] Existing mempool not found
usd1qrw038lx5n4wxx3yvuwdndpr7gnm347d6pn37uywgudzq90w7fsuk52kd5u
Send currency from the first wallet to the second wallet created in the previous step (e.g., 30 atomic units of currency)
# ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool0.dat wallet0.dat send 30 usd1qrw038lx5n4wxx3yvuwdndpr7gnm347d6pn37uywgudzq90w7fsuk52kd5u

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

mszulcz-mitre · 2022-09-27T18:26:51Z

@ykaravas repeated the steps above and also observed that the send-currency step hangs.

kylecrawshaw · 2022-09-27T18:41:58Z

I've also observed this recently when running in a non-replicated configuration. This issue does not seem to occur when running RAFT components in a replicated configuration.

HalosGhost · 2022-09-27T20:09:45Z

I can also confirm running into this. @kylecrawshaw it's interesting that it doesn't occur in replicated configurations. Does anyone have any log output that would suggest where the issue is? At least for me, it seems to result in the components hanging as well (so no logs following the hang); is that true for everyone else?

mszulcz-mitre · 2022-09-27T20:46:27Z

This commit seems to have caused the break:
b9a2597

This previous commit is the last one for which the send-currency step succeeds:
cb0e78d

HalosGhost · 2022-09-29T16:26:32Z

CC @metalicjames and @wadagso-gertjaap.

The commit that appears to have broken this process (b9a2597) really only has two effects: changing the level at which a lot of init-failure messages are logged, and removing early-returns which caused components to shutdown.

It seems fair to rule out the log-level being related. So, the most likely issue is that one of the components is failing to init properly, but doesn't recover (in a non-replicated environment), resulting in the send step hanging.

In a strange development, this issue is also not reproducible for me using the pre-built docker images from ghcr.io, only locally-built images.

Digging in a bit deeper, I was finally able to see the specific failure:

tx-sentinel0-1     | [2022-09-29 16:10:29.923] [WARN ] Failed to start coordinator client

Namely, the sentinel is failing to connect to the coordinator and at this point prints nothing further. Because the early-returns were removed, this failure leads to the system no longer being able to handle transactions at all because the sentinel (despite not fatally-erroring out) is non-functional.

This is also supported by the fact that the mint transaction still completes successfully (which, at the moment, bypasses the sentinel layer entirely). The specific failure appears to be this conditional (given the error message). However, reverting just that conditional does not result in a working environment (because now the sentinel just fatally-errors, which isn't surprising).

Should this be another place where we put an explicit init-retry to give the sentinel multiple chances to setup its coordinator client correctly?

mszulcz-mitre · 2022-09-29T17:29:37Z

@HalosGhost Allowing the sentinel controller to retry initializing the coordinator client makes the send command work as expected. Here's the code I used to replace the conditional you pointed out:

        static constexpr auto retry_delay = std::chrono::seconds(1);
        while(!m_coordinator_client.init()) {
            m_logger->warn("Failed to start coordinator client.  Retrying...");
            std::this_thread::sleep_for(retry_delay);
        }

Here's the result:

root@da310df40f5a:/opt/tx-processor# ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool0.dat wallet0.dat mint 10 5 
[2022-09-29 17:16:57.011] [WARN ] Existing wallet file not found
[2022-09-29 17:16:57.011] [WARN ] Existing client file not found
4d5375705720de249b700cacb28fc6217141a77ca85d728f79754590d2a46b29
root@da310df40f5a:/opt/tx-processor# ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool0.dat wallet0.dat info
Balance: $0.50, UTXOs: 10, pending TXs: 0
root@da310df40f5a:/opt/tx-processor# ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool1.dat wallet1.dat newaddress
[2022-09-29 17:17:43.320] [WARN ] Existing wallet file not found
[2022-09-29 17:17:43.321] [WARN ] Existing client file not found
usd1qz3mm8x3rgzdstcpuvmlh22270qvtg2laxxd2pxn2rwyvnjs9s9tknj33nl
root@da310df40f5a:/opt/tx-processor# ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool0.dat wallet0.dat send 30 usd1qz3mm8x3rgzdstcpuvmlh22270qvtg2laxxd2pxn2rwyvnjs9s9tknj33nl
tx_id:
de7bff2a620231a313f9c2ac940f721d6ecdc8ad65e3c16dcd2ca030c2a8d29a
Data for recipient importinput:
de7bff2a620231a313f9c2ac940f721d6ecdc8ad65e3c16dcd2ca030c2a8d29a00000000000000006cb2356962cbae73edfcb137e7fef19ced8f67342f0fb09b6ae970681702ee1e1e00000000000000
Sentinel responded: Confirmed

HalosGhost · 2022-09-29T18:19:06Z

We probably don't want infinite-retry… though maybe that's actually the right choice. In other places, we tend to put a delayed-retry loop (just like you have above), but with a retry-threshold (after which it would probably fatally-error out).

@metalicjames thoughts?

mszulcz-mitre · 2022-09-30T23:32:09Z

This issue seems related to Issue #127, Issue #158, and Pull Request #157. All involve the question, "What should be done when a controller fails to initialize a component: log a warning and keep going or retry the initialization some number of times?". Retrying seems to solve the problem here and in Issue #158--and potentially at least one aspect of #127. But I don't know enough about all the code to know whether it's a reasonable approach for most or all initialization attempts, or if the approach has to be decided case by case. Since those Issues have been active for a while and are relevant, maybe it's useful to have a broader discussion here.

HalosGhost · 2022-10-19T16:30:12Z

I'm also unsure about this. I think, for system stability, it probably makes sense for most initialization to use a delayed-retry-with-threshold (potentially using exponential back-off for the delay). But, we also don't want to spend too much time on this since it's really only useful for us to spend cycles on to get the system stable for testing (production-grade use-cases aren't necessarily the focus).

karlovskiy · 2022-12-05T16:13:19Z

I had same issue and have to change controller code to support retries as @mszulcz-mitre proposed earlier.

HalosGhost · 2022-12-05T16:20:18Z

Yeah, this has sat for longer than I'm comfortable with. I'll open a PR sometime in the next day or so (@mszulcz-mitre you're welcome to open it if you'd like, but you'll be credited as a co-author if not) to get this merged in.

mszulcz-mitre added the fix/bug Fixes errant behavior label Sep 27, 2022

mszulcz-mitre assigned HalosGhost Sep 27, 2022

HalosGhost mentioned this issue Dec 5, 2022

fix: Retry sentinel startup on failure #200

Merged

HalosGhost closed this as completed in #200 Dec 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Part of "Run the Code" section in README.md appears broken #186

Part of "Run the Code" section in README.md appears broken #186

mszulcz-mitre commented Sep 27, 2022

mszulcz-mitre commented Sep 27, 2022 •

edited

Loading

kylecrawshaw commented Sep 27, 2022

HalosGhost commented Sep 27, 2022

mszulcz-mitre commented Sep 27, 2022

HalosGhost commented Sep 29, 2022

mszulcz-mitre commented Sep 29, 2022 •

edited

Loading

HalosGhost commented Sep 29, 2022

mszulcz-mitre commented Sep 30, 2022

HalosGhost commented Oct 19, 2022

karlovskiy commented Dec 5, 2022

HalosGhost commented Dec 5, 2022

Part of "Run the Code" section in README.md appears broken #186

Part of "Run the Code" section in README.md appears broken #186

Comments

mszulcz-mitre commented Sep 27, 2022

Affected Branch

Basic Diagnostics

Description

Code of Conduct

mszulcz-mitre commented Sep 27, 2022 • edited Loading

kylecrawshaw commented Sep 27, 2022

HalosGhost commented Sep 27, 2022

mszulcz-mitre commented Sep 27, 2022

HalosGhost commented Sep 29, 2022

mszulcz-mitre commented Sep 29, 2022 • edited Loading

HalosGhost commented Sep 29, 2022

mszulcz-mitre commented Sep 30, 2022

HalosGhost commented Oct 19, 2022

karlovskiy commented Dec 5, 2022

HalosGhost commented Dec 5, 2022

mszulcz-mitre commented Sep 27, 2022 •

edited

Loading

mszulcz-mitre commented Sep 29, 2022 •

edited

Loading