-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[loader] Generation counters to keep global module cache in sync #15167
[loader] Generation counters to keep global module cache in sync #15167
Conversation
⏱️ 150h 37m total CI duration on this PR
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. Join @georgemitenkov and the rest of your teammates on Graphite |
5d432d0
to
108a23d
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
- Add more tests - Refactor APIs of block AptosVM - Enable global cache for e2e-move-tests
db24883
to
cbb7910
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
✅ Forge suite
|
✅ Forge suite
|
No, the job ranges are run from different processes... But I guess it's still an issue we need to address |
@msmouse how? the global cache within a single replay job gets speculative updates, this is only possible because we execute out-of-order? i.e., a job executes txns 20-30, then 10-20, etc.? |
okay it's this:
1026 , while it's trying to verify chunk 1000-1999 , it will try to replay 1027-2026 next.
|
// Otherwise, this is the first time this module gets accessed in this block. We need | ||
// to validate it is the same as in the state. We do it only once on the first access. | ||
let is_valid = builder | ||
.build(key)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this builds the whole thing from scratch? in theory we only need to check if the hash/bytes matches raw data from storage and skip the actual build part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
|
||
impl PartialEq for AptosModuleExtension { | ||
fn eq(&self, other: &Self) -> bool { | ||
self.hash.eq(&other.hash) && self.state_value_metadata().eq(other.state_value_metadata()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we skips the bytes in the new impl? optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have the hash, so should be faster
Closing because #15192 is better and solves the problem. |
Description
Previously, global module cache assumed linear execution: blocks executed in order, adding and removing entries from the global cache. This is not true if:
This PR adds a generation counter to module cache which is incremented every block. For each module in cache, we also keep track of its generation. If generation of a module is not the same as of block, we check that the cached entries are the same as the one in state (compare hash and state value metadata). If not equal, we use MV data structures, if equal, we set the generation of the module to the generation of the block. As a result, only first accesses are validated per-block.
This does have a small impact on performance (e.g., ~1k TPS on no-op) but not sufficient for regressions and this is still better than V1 implementation when there are many modules. For now, this is sufficient to be able to 1) pass replay, 2) protect against retries, 3) ensure that we always sync with the state view when executing the block. The last part is actually great - it allows us to share the cache even across multiple tests, etc. because we will invalidate entries which do not pass the check w.r.t. state view.
Replay run: https://github.com/aptos-labs/aptos-core/actions/runs/11643658386. There are only few mainnet failures, no more 300+ errors including weird backwards compatibility error which confirms the hypothesis.
How Has This Been Tested?
Key Areas to Review
Type of Change
Which Components or Systems Does This Change Impact?
Checklist