You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In scope of addressing comments for #6632, specifically when addressing comment: #6632 (comment) I wasn't able to address it without further time investment.
This doesn't happen every time but I can reproduce it quite reliable simple by re-runing TestRecoverEpoch integration test.
First DKG fails as planned because we stop collection node. After entering recovery epoch we expect that DKG will succeed and we will be able to enter epoch after the recovery. In this test setup epochs are numbered in next way:
0 - starting epoch which enters EFM
1 - recovery epoch
2 - epoch after recovery.
DKG process is always started when setup phase starts for epoch 1, the problem manifests itself later. Some error logs during DKG:
This log is always reported when DKG fails, and it is reported for every node.
Often I have seen such logs from all nodes, seems like nodes are flagging each other:
{"level":"warn","node_role":"consensus","node_id":"086ee2bbea1aa36181523595acffd096ce7557aca80bde6ee2e5be1bae690b91","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-2","time":"2024-11-15T17:13:22.200603269Z","message":"participant 0 (this node) is flagging participant (index=2, node_id=4d254e362657f592af16708e6aadf6232d689e4a072e4369f4841860d9de732b) because: building a complaint"}
Flagging of participant leads to next log:
{"level":"warn","node_role":"consensus","node_id":"4d254e362657f592af16708e6aadf6232d689e4a072e4369f4841860d9de732b","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-2","time":"2024-11-15T17:13:30.513419339Z","message":"participant 2 (this node) is disqualifying participant (index=0, node_id=086ee2bbea1aa36181523595acffd096ce7557aca80bde6ee2e5be1bae690b91) because: there are 2 complaints, they exceeded the threshold 1"}
Additionally there is extra flagging:
{"level":"warn","node_role":"consensus","node_id":"086ee2bbea1aa36181523595acffd096ce7557aca80bde6ee2e5be1bae690b91","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-2","time":"2024-11-15T17:13:33.278872521Z","message":"participant 0 (this node) is flagging participant (index=1, node_id=3ec567126474f024f2fff57b1e5e626e803ec56f3664c10ef8e49360dafc40ba) because: complaint received after the complaint timeout"}
In this end each node reports the same error:
{"level":"warn","node_role":"consensus","node_id":"4d254e362657f592af16708e6aadf6232d689e4a072e4369f4841860d9de732b","engine":"dkg_reactor","error":"Joint-Feldman failed because the disqualified participants number is high: 2 disqualified, threshold is 1, size is 3","time":"2024-11-15T17:13:38.790678258Z","message":"node 4d254e362657f592af16708e6aadf6232d689e4a072e4369f4841860d9de732b with index 2 failed DKG locally"}
Which leads to submitting empty DKG result:
{"level":"warn","node_role":"consensus","node_id":"4d254e362657f592af16708e6aadf6232d689e4a072e4369f4841860d9de732b","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-2","time":"2024-11-15T17:13:38.791592325Z","message":"submitting empty dkg result because I completed the DKG unsuccessfully"}
Possible reasons for failed DKG based on logs:
Collection node doesn't catch up after pause
Previously failed DKG interferes with node state and doesn't allow to perform DKG in next epoch
Invalid configuration of test
Definition of done
Investigate reason for failing DKG after failed DKG in previous epoch
The text was updated successfully, but these errors were encountered:
Of these errors, I am most concerned about: {"level":"error","node_role":"consensus","node_id":"3ec567126474f024f2fff57b1e5e626e803ec56f3664c10ef8e49360dafc40ba","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-1","error":"wrong DKG instance. Got dkg-flow-localnet-2, want dkg-flow-localnet-1","time":"2024-11-15T17:13:15.06699973Z","message":"bad message"}
This indicates that:
2 Consensus nodes are simultaneously running different versions of the DKG engines (one for epoch 1, one for epoch 2)
a DKG message for epoch 2 was routed to an engine operating the DKG for epoch 1
I can see two possible reasons for this:
1. The receiver Consensus Node has a lagging local state.
This is an expected possibility in general, but unlikely/impossible to be the case here. Because there are very few Consensus Nodes in the test case (I believe 3), in practice all would need to be online to maintain liveness and a consistent block rate. So I assume that all Consensus Nodes are synchronized on the latest view.
(Maybe it is possible that the asynchronous event handlers are lagging on one node, despite it being synchronized on the latest view? Seems unlikely though.)
2. The receiver Consensus Node failed to properly tear down its DKG engine for epoch 1 (even after the DKG for epoch 2 had already started)
If true, this is dangerous because an old "hanging" DKG engine could process messages intended for a newer DKG.
The sender would think the message was successfully delivered, and no re-send.
The "hanging" engine would discard the message, because it is for the wrong DKG instance
The correct DKG instance would never receive the message
Suspected Cause
After looking through the code, I suspect Controller.Shutdown() -- which tears down the DKG instance broker -- is not being called when we enter, then recover from EFM.
Context
In scope of addressing comments for #6632, specifically when addressing comment: #6632 (comment) I wasn't able to address it without further time investment.
Leaving some details for future investigation:
Reference commit: 14e37fa
This doesn't happen every time but I can reproduce it quite reliable simple by re-runing
TestRecoverEpoch
integration test.First DKG fails as planned because we stop collection node. After entering recovery epoch we expect that DKG will succeed and we will be able to enter epoch after the recovery. In this test setup epochs are numbered in next way:
DKG process is always started when setup phase starts for epoch 1, the problem manifests itself later. Some error logs during DKG:
This log is always reported when DKG fails, and it is reported for every node.
Often I have seen such logs from all nodes, seems like nodes are flagging each other:
Flagging of participant leads to next log:
Additionally there is extra flagging:
In this end each node reports the same error:
Which leads to submitting empty DKG result:
Possible reasons for failed DKG based on logs:
Definition of done
The text was updated successfully, but these errors were encountered: