Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BFT] Investigate why consensus nodes fail DKG #6729

Open
durkmurder opened this issue Nov 15, 2024 · 1 comment
Open

[BFT] Investigate why consensus nodes fail DKG #6729

durkmurder opened this issue Nov 15, 2024 · 1 comment
Assignees
Labels
Bug Something isn't working Preserve Stale Bot repellent S-BFT S-Consensus

Comments

@durkmurder
Copy link
Member

Context

In scope of addressing comments for #6632, specifically when addressing comment: #6632 (comment) I wasn't able to address it without further time investment.

Leaving some details for future investigation:

Reference commit: 14e37fa

This doesn't happen every time but I can reproduce it quite reliable simple by re-runing TestRecoverEpoch integration test.

First DKG fails as planned because we stop collection node. After entering recovery epoch we expect that DKG will succeed and we will be able to enter epoch after the recovery. In this test setup epochs are numbered in next way:

  • 0 - starting epoch which enters EFM
  • 1 - recovery epoch
  • 2 - epoch after recovery.

DKG process is always started when setup phase starts for epoch 1, the problem manifests itself later. Some error logs during DKG:

This log is always reported when DKG fails, and it is reported for every node.

{"level":"error","node_role":"consensus","node_id":"3ec567126474f024f2fff57b1e5e626e803ec56f3664c10ef8e49360dafc40ba","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-1","error":"wrong DKG instance. Got dkg-flow-localnet-2, want dkg-flow-localnet-1","time":"2024-11-15T17:13:15.06699973Z","message":"bad message"}

Often I have seen such logs from all nodes, seems like nodes are flagging each other:

{"level":"warn","node_role":"consensus","node_id":"086ee2bbea1aa36181523595acffd096ce7557aca80bde6ee2e5be1bae690b91","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-2","time":"2024-11-15T17:13:22.200603269Z","message":"participant 0 (this node) is flagging participant (index=2, node_id=4d254e362657f592af16708e6aadf6232d689e4a072e4369f4841860d9de732b) because: building a complaint"}

Flagging of participant leads to next log:

{"level":"warn","node_role":"consensus","node_id":"4d254e362657f592af16708e6aadf6232d689e4a072e4369f4841860d9de732b","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-2","time":"2024-11-15T17:13:30.513419339Z","message":"participant 2 (this node) is disqualifying participant (index=0, node_id=086ee2bbea1aa36181523595acffd096ce7557aca80bde6ee2e5be1bae690b91) because: there are 2 complaints, they exceeded the threshold 1"}

Additionally there is extra flagging:

{"level":"warn","node_role":"consensus","node_id":"086ee2bbea1aa36181523595acffd096ce7557aca80bde6ee2e5be1bae690b91","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-2","time":"2024-11-15T17:13:33.278872521Z","message":"participant 0 (this node) is flagging participant (index=1, node_id=3ec567126474f024f2fff57b1e5e626e803ec56f3664c10ef8e49360dafc40ba) because: complaint received after the complaint timeout"}

In this end each node reports the same error:

{"level":"warn","node_role":"consensus","node_id":"4d254e362657f592af16708e6aadf6232d689e4a072e4369f4841860d9de732b","engine":"dkg_reactor","error":"Joint-Feldman failed because the disqualified participants number is high: 2 disqualified, threshold is 1, size is 3","time":"2024-11-15T17:13:38.790678258Z","message":"node 4d254e362657f592af16708e6aadf6232d689e4a072e4369f4841860d9de732b with index 2 failed DKG locally"}

Which leads to submitting empty DKG result:

{"level":"warn","node_role":"consensus","node_id":"4d254e362657f592af16708e6aadf6232d689e4a072e4369f4841860d9de732b","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-2","time":"2024-11-15T17:13:38.791592325Z","message":"submitting empty dkg result because I completed the DKG unsuccessfully"}

Possible reasons for failed DKG based on logs:

  • Collection node doesn't catch up after pause
  • Previously failed DKG interferes with node state and doesn't allow to perform DKG in next epoch
  • Invalid configuration of test

Definition of done

  • Investigate reason for failing DKG after failed DKG in previous epoch
@durkmurder durkmurder added Bug Something isn't working Preserve Stale Bot repellent S-BFT S-Consensus labels Nov 15, 2024
@durkmurder durkmurder self-assigned this Nov 15, 2024
@jordanschalm
Copy link
Member

jordanschalm commented Nov 15, 2024

Of these errors, I am most concerned about:
{"level":"error","node_role":"consensus","node_id":"3ec567126474f024f2fff57b1e5e626e803ec56f3664c10ef8e49360dafc40ba","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-1","error":"wrong DKG instance. Got dkg-flow-localnet-2, want dkg-flow-localnet-1","time":"2024-11-15T17:13:15.06699973Z","message":"bad message"}

This indicates that:

  • 2 Consensus nodes are simultaneously running different versions of the DKG engines (one for epoch 1, one for epoch 2)
  • a DKG message for epoch 2 was routed to an engine operating the DKG for epoch 1

I can see two possible reasons for this:

1. The receiver Consensus Node has a lagging local state.

  • This is an expected possibility in general, but unlikely/impossible to be the case here. Because there are very few Consensus Nodes in the test case (I believe 3), in practice all would need to be online to maintain liveness and a consistent block rate. So I assume that all Consensus Nodes are synchronized on the latest view.
  • (Maybe it is possible that the asynchronous event handlers are lagging on one node, despite it being synchronized on the latest view? Seems unlikely though.)

2. The receiver Consensus Node failed to properly tear down its DKG engine for epoch 1 (even after the DKG for epoch 2 had already started)

  • If true, this is dangerous because an old "hanging" DKG engine could process messages intended for a newer DKG.
    • The sender would think the message was successfully delivered, and no re-send.
    • The "hanging" engine would discard the message, because it is for the wrong DKG instance
    • The correct DKG instance would never receive the message

Suspected Cause

After looking through the code, I suspect Controller.Shutdown() -- which tears down the DKG instance broker -- is not being called when we enter, then recover from EFM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working Preserve Stale Bot repellent S-BFT S-Consensus
Projects
None yet
Development

No branches or pull requests

2 participants