Answering: "What happens today when we run nimbus --goerli
?"
#862
Labels
Networking
Security
Security vulnerability or security related
Sync
Prevents or affects sync with Ethereum network
This simple question was getting speculative answers, but for one reason or another, nobody really tested it in a long time.
This meta issue is to write down what really happens when we run
nimbus --goerli
, no other changes, to discover all the significant issues. See also related issue #687 "Sync to Goerli". Issues that come up are to be filed separately and fixed one by one, outside this meta issue. Ideally we will file individual issues and fixes, and update this meta issue to point to them.Time and space required
It has proven useful to know a guideline for how much time and storage to expect, so a Goerli sync can be replicated without going through the tedium of trial and error, disk full recovery efforts, etc.
(Later commits are required to complete, see issues below)
To finish in a similar time, you will need to run
nimbus --goerli
in a loop to auto-restart it when it crashes, and with enough storage space. The time shown above does not count stops during the test where Nimbus crashed and was later restarted after analysis, time to recover from disk full conditions, or time spent syncing which was reverted to a clean storage snapshot. (True calendar time for this test was 28 days 18 hours).What happened
Short version:
nimbus
in a loop.Points where bulk sync stopped
The issues in this section need fixes to allow sync to reach the Goerli head block in October 2021. All are fixed now.
Note: Berlin starts at block 4460644, London starts at block 5062605.
Progress slowed dramatically after the start of Berlin fork, 4460644.
GetBlockHeaders
, but only after the Nimbus sync had reached Berlin.checks".
nim-confutils
".Progress stopped at block 4463617. This block number was due to a consensus bug at block 4463713 (see next), and the batching logic in
blockchain_sync
which does 192 blocks at a time and aborts the whole batch when any block fails.Consensus bug at block 4463713. This was an incorrect gas calculation in the EVM, but it manifest as a mismatching
receiptRoot
because it didn't change the state, only the receipt.receiptRoot
starting from Sync to Goerli #687 (comment).Progress stopped at block 4494913. This was due to a sync networking bug in
blockchain_sync
, issuing requests for N block bodies and incorrectly handling replies which contained fewer than N bodies.GetBlockBodies
logic preventing sync, dropping peers #865 "Sync fix:GetBlockBodies
logic preventing sync, dropping peers".Progress stopped at block 5080897. This block number was due to a consensus bug at block 5080941 (see next), and the batching logic in
blockchain_sync
which does 192 blocks at a time and aborts the whole batch when any block fails.Consensus bug at block 5080941. This was incorrect return data when a
CREATE
orCREATE2
operation tried to make a contract larger than 24 kiB and the caller looked at theRETURNDATA
orRETURNDATASIZE
.RETURNDATA
when this happens.Crash types
The crashes in this section have a simple workaround: Restart Nimbus-eth1. For the disk full crash of course you must add more disk space to stop it happening.
Of course the workaround is not good enough for a release. It's useful to have this set of observations so we can address them one by one. Fixes will be noted here as they are implemented. However, for the purpose of working with Nimbus-eth1, it's good to know just restarting is a temporary solution to these issues.
Symptom (seen 10 times during Goerli sync):
AssertionError
with no useful stacktrace.devp2p
accepts too many incoming connections.libbacktrace
can't open/proc/self/exe
to read the executable. It's been proposed to changelibbacktrace
but that wouldn't address the real problem here, which is about sockets notlibbacktrace
.ulimit -S -n 65536
. This will allow Nimbus-eth1 to run longer between these types of crashes.accept()
queue complicates accounting.Symptom (seen 1 time during Goerli sync):
Too many open files [ResultDefect]
with no useful stacktrace.devp2p
accepts too many incoming connections.Symptom (seen 10 times during Goerli sync):
[Defect] ac.savePoint == sp and sp.state == Pending
.AssertionError
when syncing goerli #582 "AssertionError when syncing goerli", but there are enough differences that it may not.Symptom (seen 4 times during Goerli sync):
[RangeError] value out of range: -87 notin 0 .. 9223372036854775807
.devp2p
code being insufficiently resilient to this.Symptom (seen 10 times during Goerli sync):
FutureDefect
relating torlpx.nim
:Received msg already expired
was always found in the page of log output prior to this crash, which lends weight to the idea that this is the cause.Future
and timouts.p2p_protocol_dsl.nim
andrlpx.nim
) are equivalent to the line numbers in those issues, just a different time in repo history.Symptom (seen 2 times during earlier Goerli tests, not the latest):
[IndexError] index 33 not in 0 .. 32
.devp2p
networking code which should not crash when it receives a faulty message.After bulk sync is completed to current head block of Goerli
Real-time chain updates are not followed. It does pick up new blocks, but only about once every 2-17 minutes.
Occasional symptom in the synchronised state:
FutureDefect
relating topersistBlocksImpl
.blockchain_sync
cannot handle chain reorg from the network.BlockNotFound
should be handled and ablockchain_sync
work item should be aborted, instead of crashing.BlockNotFound
is occurring because the new blocks received refer to a parent block that doesn't exist locally.blockchain_sync
's block fetching strategy and data model is not designed for multiple simultaneously competing branches.The inability to handle reorgs after reaching the chain head suggests bulk sync also has a bug, one that would only rarely manifest. It suggets incorrect chain handling in the last few hundreds (thousands at a push) of blocks of bulk sync, if there are competing chains being received. Instead of switching to the winning chain, because of the data model and fetching strategy it will receive conflicting data (over time or from different peers) causing an unfetched gap in received blocks on the new winning chain.
The text was updated successfully, but these errors were encountered: