Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Part of "Run the Code" section in README.md appears broken #186

Closed
3 tasks done
mszulcz-mitre opened this issue Sep 27, 2022 · 11 comments · Fixed by #200
Closed
3 tasks done

Part of "Run the Code" section in README.md appears broken #186

mszulcz-mitre opened this issue Sep 27, 2022 · 11 comments · Fixed by #200
Assignees
Labels
fix/bug Fixes errant behavior

Comments

@mszulcz-mitre
Copy link
Contributor

Affected Branch

trunk

Basic Diagnostics

  • I've pulled the latest changes on the affected branch and the issue is still present.

  • The issue is reproducible in docker

Description

In the "Run the Code" section of README.md, the following step hangs:

  • Send currency from the first wallet to the second wallet created in the previous step (e.g., 30 atomic units of currency)
    # ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool0.dat wallet0.dat send 30 <wallet address>
    

where <wallet address> is the address returned from running the client-cli with the newaddress keyword.

I followed the instructions in the sections "Launch the System" and "Setup test wallets and test them" for the 2PC architecture. The instructions are:

  1. Run the System
    # docker compose --file docker-compose-2pc.yml up --build
    
  2. Launch a container in which to run wallet commands (use --network atomizer-network instead of --network 2pc-network if using the atomizer architecture)
    # docker run --network 2pc-network -ti opencbdc-tx /bin/bash
    
  • Mint new coins (e.g., 10 new UTXOs each with a value of 5 atomic units of currency)

    # ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool0.dat wallet0.dat mint 10 5
    [2021-08-17 15:11:57.686] [WARN ] Existing wallet file not found
    [2021-08-17 15:11:57.686] [WARN ] Existing mempool not found
    4bc23da407c3a8110145c5b6c38199c8ec3b0e35ea66bbfd78f0ed65304ce6fa
    
  • Inspect the balance of a wallet

    # ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool0.dat wallet0.dat info
    Balance: $0.50, UTXOs: 10, pending TXs: 0
    
  • Make a new wallet

    # ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool1.dat wallet1.dat newaddress
    [2021-08-17 15:13:16.148] [WARN ] Existing wallet file not found
    [2021-08-17 15:13:16.148] [WARN ] Existing mempool not found
    usd1qrw038lx5n4wxx3yvuwdndpr7gnm347d6pn37uywgudzq90w7fsuk52kd5u
    
  • Send currency from the first wallet to the second wallet created in the previous step (e.g., 30 atomic units of currency)

    # ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool0.dat wallet0.dat send 30 usd1qrw038lx5n4wxx3yvuwdndpr7gnm347d6pn37uywgudzq90w7fsuk52kd5u
    

Code of Conduct

  • I agree to follow this project's Code of Conduct
@mszulcz-mitre mszulcz-mitre added the fix/bug Fixes errant behavior label Sep 27, 2022
@mszulcz-mitre
Copy link
Contributor Author

mszulcz-mitre commented Sep 27, 2022

@ykaravas repeated the steps above and also observed that the send-currency step hangs.

@kylecrawshaw
Copy link
Contributor

I've also observed this recently when running in a non-replicated configuration. This issue does not seem to occur when running RAFT components in a replicated configuration.

@HalosGhost
Copy link
Collaborator

I can also confirm running into this. @kylecrawshaw it's interesting that it doesn't occur in replicated configurations. Does anyone have any log output that would suggest where the issue is? At least for me, it seems to result in the components hanging as well (so no logs following the hang); is that true for everyone else?

@mszulcz-mitre
Copy link
Contributor Author

This commit seems to have caused the break:
b9a2597

This previous commit is the last one for which the send-currency step succeeds:
cb0e78d

@HalosGhost
Copy link
Collaborator

CC @metalicjames and @wadagso-gertjaap.

The commit that appears to have broken this process (b9a2597) really only has two effects: changing the level at which a lot of init-failure messages are logged, and removing early-returns which caused components to shutdown.

It seems fair to rule out the log-level being related. So, the most likely issue is that one of the components is failing to init properly, but doesn't recover (in a non-replicated environment), resulting in the send step hanging.

In a strange development, this issue is also not reproducible for me using the pre-built docker images from ghcr.io, only locally-built images.

Digging in a bit deeper, I was finally able to see the specific failure:

tx-sentinel0-1     | [2022-09-29 16:10:29.923] [WARN ] Failed to start coordinator client

Namely, the sentinel is failing to connect to the coordinator and at this point prints nothing further. Because the early-returns were removed, this failure leads to the system no longer being able to handle transactions at all because the sentinel (despite not fatally-erroring out) is non-functional.

This is also supported by the fact that the mint transaction still completes successfully (which, at the moment, bypasses the sentinel layer entirely). The specific failure appears to be this conditional (given the error message). However, reverting just that conditional does not result in a working environment (because now the sentinel just fatally-errors, which isn't surprising).

Should this be another place where we put an explicit init-retry to give the sentinel multiple chances to setup its coordinator client correctly?

@mszulcz-mitre
Copy link
Contributor Author

mszulcz-mitre commented Sep 29, 2022

@HalosGhost Allowing the sentinel controller to retry initializing the coordinator client makes the send command work as expected. Here's the code I used to replace the conditional you pointed out:

        static constexpr auto retry_delay = std::chrono::seconds(1);
        while(!m_coordinator_client.init()) {
            m_logger->warn("Failed to start coordinator client.  Retrying...");
            std::this_thread::sleep_for(retry_delay);
        }

Here's the result:

root@da310df40f5a:/opt/tx-processor# ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool0.dat wallet0.dat mint 10 5 
[2022-09-29 17:16:57.011] [WARN ] Existing wallet file not found
[2022-09-29 17:16:57.011] [WARN ] Existing client file not found
4d5375705720de249b700cacb28fc6217141a77ca85d728f79754590d2a46b29
root@da310df40f5a:/opt/tx-processor# ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool0.dat wallet0.dat info
Balance: $0.50, UTXOs: 10, pending TXs: 0
root@da310df40f5a:/opt/tx-processor# ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool1.dat wallet1.dat newaddress
[2022-09-29 17:17:43.320] [WARN ] Existing wallet file not found
[2022-09-29 17:17:43.321] [WARN ] Existing client file not found
usd1qz3mm8x3rgzdstcpuvmlh22270qvtg2laxxd2pxn2rwyvnjs9s9tknj33nl
root@da310df40f5a:/opt/tx-processor# ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool0.dat wallet0.dat send 30 usd1qz3mm8x3rgzdstcpuvmlh22270qvtg2laxxd2pxn2rwyvnjs9s9tknj33nl
tx_id:
de7bff2a620231a313f9c2ac940f721d6ecdc8ad65e3c16dcd2ca030c2a8d29a
Data for recipient importinput:
de7bff2a620231a313f9c2ac940f721d6ecdc8ad65e3c16dcd2ca030c2a8d29a00000000000000006cb2356962cbae73edfcb137e7fef19ced8f67342f0fb09b6ae970681702ee1e1e00000000000000
Sentinel responded: Confirmed

@HalosGhost
Copy link
Collaborator

We probably don't want infinite-retry… though maybe that's actually the right choice. In other places, we tend to put a delayed-retry loop (just like you have above), but with a retry-threshold (after which it would probably fatally-error out).

@metalicjames thoughts?

@mszulcz-mitre
Copy link
Contributor Author

This issue seems related to Issue #127, Issue #158, and Pull Request #157. All involve the question, "What should be done when a controller fails to initialize a component: log a warning and keep going or retry the initialization some number of times?". Retrying seems to solve the problem here and in Issue #158--and potentially at least one aspect of #127. But I don't know enough about all the code to know whether it's a reasonable approach for most or all initialization attempts, or if the approach has to be decided case by case. Since those Issues have been active for a while and are relevant, maybe it's useful to have a broader discussion here.

@HalosGhost
Copy link
Collaborator

I'm also unsure about this. I think, for system stability, it probably makes sense for most initialization to use a delayed-retry-with-threshold (potentially using exponential back-off for the delay). But, we also don't want to spend too much time on this since it's really only useful for us to spend cycles on to get the system stable for testing (production-grade use-cases aren't necessarily the focus).

@karlovskiy
Copy link

I had same issue and have to change controller code to support retries as @mszulcz-mitre proposed earlier.

@HalosGhost
Copy link
Collaborator

Yeah, this has sat for longer than I'm comfortable with. I'll open a PR sometime in the next day or so (@mszulcz-mitre you're welcome to open it if you'd like, but you'll be credited as a co-author if not) to get this merged in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix/bug Fixes errant behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants