-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Part of "Run the Code" section in README.md appears broken #186
Comments
@ykaravas repeated the steps above and also observed that the send-currency step hangs. |
I've also observed this recently when running in a non-replicated configuration. This issue does not seem to occur when running RAFT components in a replicated configuration. |
I can also confirm running into this. @kylecrawshaw it's interesting that it doesn't occur in replicated configurations. Does anyone have any log output that would suggest where the issue is? At least for me, it seems to result in the components hanging as well (so no logs following the hang); is that true for everyone else? |
CC @metalicjames and @wadagso-gertjaap. The commit that appears to have broken this process (b9a2597) really only has two effects: changing the level at which a lot of init-failure messages are logged, and removing early- It seems fair to rule out the log-level being related. So, the most likely issue is that one of the components is failing to init properly, but doesn't recover (in a non-replicated environment), resulting in the send step hanging. In a strange development, this issue is also not reproducible for me using the pre-built docker images from ghcr.io, only locally-built images. Digging in a bit deeper, I was finally able to see the specific failure:
Namely, the sentinel is failing to connect to the coordinator and at this point prints nothing further. Because the early- This is also supported by the fact that the Should this be another place where we put an explicit init-retry to give the sentinel multiple chances to setup its coordinator client correctly? |
@HalosGhost Allowing the sentinel controller to retry initializing the coordinator client makes the static constexpr auto retry_delay = std::chrono::seconds(1);
while(!m_coordinator_client.init()) {
m_logger->warn("Failed to start coordinator client. Retrying...");
std::this_thread::sleep_for(retry_delay);
} Here's the result:
|
We probably don't want infinite-retry… though maybe that's actually the right choice. In other places, we tend to put a delayed-retry loop (just like you have above), but with a retry-threshold (after which it would probably fatally-error out). @metalicjames thoughts? |
This issue seems related to Issue #127, Issue #158, and Pull Request #157. All involve the question, "What should be done when a controller fails to initialize a component: log a warning and keep going or retry the initialization some number of times?". Retrying seems to solve the problem here and in Issue #158--and potentially at least one aspect of #127. But I don't know enough about all the code to know whether it's a reasonable approach for most or all initialization attempts, or if the approach has to be decided case by case. Since those Issues have been active for a while and are relevant, maybe it's useful to have a broader discussion here. |
I'm also unsure about this. I think, for system stability, it probably makes sense for most initialization to use a delayed-retry-with-threshold (potentially using exponential back-off for the delay). But, we also don't want to spend too much time on this since it's really only useful for us to spend cycles on to get the system stable for testing (production-grade use-cases aren't necessarily the focus). |
I had same issue and have to change controller code to support retries as @mszulcz-mitre proposed earlier. |
Yeah, this has sat for longer than I'm comfortable with. I'll open a PR sometime in the next day or so (@mszulcz-mitre you're welcome to open it if you'd like, but you'll be credited as a co-author if not) to get this merged in. |
Affected Branch
trunk
Basic Diagnostics
I've pulled the latest changes on the affected branch and the issue is still present.
The issue is reproducible in docker
Description
In the "Run the Code" section of README.md, the following step hangs:
where
<wallet address>
is the address returned from running theclient-cli
with thenewaddress
keyword.I followed the instructions in the sections "Launch the System" and "Setup test wallets and test them" for the 2PC architecture. The instructions are:
Code of Conduct
The text was updated successfully, but these errors were encountered: