Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replay Verify scheduler #15103

Merged
merged 1 commit into from
Nov 12, 2024
Merged

Replay Verify scheduler #15103

merged 1 commit into from
Nov 12, 2024

Conversation

areshand
Copy link
Contributor

@areshand areshand commented Oct 28, 2024

Description

  1. A basic scheduler that can schedule tasks based on config and collect results for reporting.
    Next step is to put it on a real cluster and add it to github workflow
  2. automation for creating snapshot from archive node, creating disk, PV, PVC for k8s cluster
  3. fix a bug on replay-on-archive when no txn to be executed

How Has This Been Tested?

Tested on my local kube cluster

Key Areas to Review

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Performance improvement
  • Refactoring
  • Dependency update
  • Documentation update
  • Tests

Which Components or Systems Does This Change Impact?

  • Validator Node
  • Full Node (API, Indexer, etc.)
  • Move/Aptos Virtual Machine
  • Aptos Framework
  • Aptos CLI/SDK
  • Developer Infrastructure
  • Move Compiler
  • Other (specify)

Checklist

  • I have read and followed the CONTRIBUTING doc
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I identified and added all stakeholders and component owners affected by this change as reviewers
  • I tested both happy and unhappy path of the functionality
  • I have made corresponding changes to the documentation

Copy link

trunk-io bot commented Oct 28, 2024

⏱️ 2h 54m total CI duration on this PR
Slowest 15 Jobs Cumulative Duration Recent Runs
rust-cargo-deny 25m 🟩🟩🟩🟩 (+11 more)
check-dynamic-deps 21m 🟩🟩🟩🟩🟩 (+12 more)
rust-move-tests 10m 🟩
rust-move-tests 9m 🟩
rust-move-tests 9m 🟩
rust-move-tests 9m 🟩
rust-move-tests 9m 🟩
rust-move-tests 9m 🟩
rust-move-tests 9m 🟩
rust-move-tests 9m 🟩
rust-move-tests 9m 🟩
rust-move-tests 9m 🟩
semgrep/ci 8m 🟩🟩🟩🟩🟩 (+12 more)
general-lints 7m 🟩🟩🟩🟩 (+11 more)
rust-move-tests 7m

🚨 1 job on the last run was significantly faster/slower than expected

Job Duration vs 7d avg Delta
check-dynamic-deps 2m 1m +94%

settingsfeedbackdocs ⋅ learn more about trunk.io

@areshand areshand force-pushed the new_replay_verify branch 5 times, most recently from 2d4bcf2 to bdb718b Compare October 30, 2024 17:19
@areshand areshand changed the title New replay verify Replay Verify scheduler Oct 30, 2024
@areshand areshand marked this pull request as ready for review October 30, 2024 17:24
@areshand areshand force-pushed the new_replay_verify branch 2 times, most recently from c776074 to b40e99a Compare October 30, 2024 17:35
@areshand areshand requested a review from aluon October 30, 2024 17:40
Copy link
Contributor

@msmouse msmouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪄

from enum import Enum


# Hyperparameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what? is there any machine learning involved? 😂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the ones that are supposed to passed by config
The rest should be fixed

Comment on lines +36 to +62
# Check if the environment variable already exists
for env_var in container["env"]:
if env_var["name"] == name:
env_var["value"] = value
return

# If it doesn't exist, add it
container["env"].append({"name": name, "value": value})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what, you can't just "upsert"?
(I've forgotten Python)

Copy link
Contributor Author

@areshand areshand Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a vector of map/dict and each dict is {"name": name, "value": value}. The format is from k8s config.

testsuite/replay-verify/main.py Outdated Show resolved Hide resolved
testsuite/replay-verify/main.py Outdated Show resolved Hide resolved
testsuite/replay-verify/main.py Outdated Show resolved Hide resolved
testsuite/replay-verify/main.py Show resolved Hide resolved
serviceAccountName: default
restartPolicy: Never # Pod restarts only if it fails
containers:
- name: replay-verify-worker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably gonna wanna generate this in code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is updated dynamically instart()function. This file serves as a template

Copy link
Contributor

@perryjrandall perryjrandall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might wanna add replay-verify as a tool in poetry so you can poetry run the tool

https://gist.github.com/perryjrandall/cc38e1a47c49d7da807a512ce7283796

@areshand areshand force-pushed the new_replay_verify branch 6 times, most recently from 26d9c75 to b49aa77 Compare November 11, 2024 22:50
@areshand areshand added the CICD:build-performance-images build performance docker image variants label Nov 11, 2024
@areshand areshand enabled auto-merge (rebase) November 12, 2024 18:34

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 957ee8f25fe21c155227ecde11b746eef469e43d

two traffics test: inner traffic : committed: 14430.47 txn/s, latency: 2757.37 ms, (p50: 2700 ms, p70: 2700, p90: 2900 ms, p99: 3200 ms), latency samples: 5486780
two traffics test : committed: 100.09 txn/s, latency: 1603.13 ms, (p50: 1300 ms, p70: 1400, p90: 1600 ms, p99: 9400 ms), latency samples: 1760
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 2.038, avg: 1.584", "ConsensusProposalToOrdered: max: 0.309, avg: 0.292", "ConsensusOrderedToCommit: max: 0.363, avg: 0.357", "ConsensusProposalToCommit: max: 0.657, avg: 0.648"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 1.08s no progress at version 2682400 (avg 0.20s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 8.50s no progress at version 2682398 (avg 8.50s) [limit 15].
Test Ok

Copy link
Contributor

✅ Forge suite framework_upgrade success on fcd2dedf6ca61a23f4979e944c629dcdbdae5dca ==> 957ee8f25fe21c155227ecde11b746eef469e43d

Compatibility test results for fcd2dedf6ca61a23f4979e944c629dcdbdae5dca ==> 957ee8f25fe21c155227ecde11b746eef469e43d (PR)
Upgrade the nodes to version: 957ee8f25fe21c155227ecde11b746eef469e43d
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1413.10 txn/s, submitted: 1415.63 txn/s, failed submission: 2.53 txn/s, expired: 2.53 txn/s, latency: 2247.20 ms, (p50: 2100 ms, p70: 2300, p90: 3300 ms, p99: 5200 ms), latency samples: 122900
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1288.27 txn/s, submitted: 1289.81 txn/s, failed submission: 1.54 txn/s, expired: 1.54 txn/s, latency: 2307.37 ms, (p50: 2100 ms, p70: 2400, p90: 3900 ms, p99: 5200 ms), latency samples: 117240
5. check swarm health
Compatibility test for fcd2dedf6ca61a23f4979e944c629dcdbdae5dca ==> 957ee8f25fe21c155227ecde11b746eef469e43d passed
Upgrade the remaining nodes to version: 957ee8f25fe21c155227ecde11b746eef469e43d
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1313.48 txn/s, submitted: 1317.55 txn/s, failed submission: 4.07 txn/s, expired: 4.07 txn/s, latency: 2305.59 ms, (p50: 2100 ms, p70: 2400, p90: 3900 ms, p99: 5700 ms), latency samples: 116180
Test Ok

Copy link
Contributor

✅ Forge suite compat success on fcd2dedf6ca61a23f4979e944c629dcdbdae5dca ==> 957ee8f25fe21c155227ecde11b746eef469e43d

Compatibility test results for fcd2dedf6ca61a23f4979e944c629dcdbdae5dca ==> 957ee8f25fe21c155227ecde11b746eef469e43d (PR)
1. Check liveness of validators at old version: fcd2dedf6ca61a23f4979e944c629dcdbdae5dca
compatibility::simple-validator-upgrade::liveness-check : committed: 17053.15 txn/s, latency: 1973.82 ms, (p50: 1900 ms, p70: 2100, p90: 2200 ms, p99: 3300 ms), latency samples: 556620
2. Upgrading first Validator to new version: 957ee8f25fe21c155227ecde11b746eef469e43d
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 7058.06 txn/s, latency: 4025.58 ms, (p50: 4400 ms, p70: 4800, p90: 4800 ms, p99: 4900 ms), latency samples: 130080
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 7183.52 txn/s, latency: 4496.61 ms, (p50: 4900 ms, p70: 5000, p90: 5700 ms, p99: 6400 ms), latency samples: 237200
3. Upgrading rest of first batch to new version: 957ee8f25fe21c155227ecde11b746eef469e43d
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 7749.55 txn/s, latency: 3735.14 ms, (p50: 4100 ms, p70: 4400, p90: 4500 ms, p99: 4600 ms), latency samples: 145320
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 7607.30 txn/s, latency: 4226.81 ms, (p50: 4500 ms, p70: 4600, p90: 5400 ms, p99: 5900 ms), latency samples: 253620
4. upgrading second batch to new version: 957ee8f25fe21c155227ecde11b746eef469e43d
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 11377.12 txn/s, latency: 2446.94 ms, (p50: 2600 ms, p70: 2900, p90: 3200 ms, p99: 3300 ms), latency samples: 197360
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 10757.14 txn/s, latency: 2905.02 ms, (p50: 2800 ms, p70: 3100, p90: 4500 ms, p99: 5400 ms), latency samples: 350140
5. check swarm health
Compatibility test for fcd2dedf6ca61a23f4979e944c629dcdbdae5dca ==> 957ee8f25fe21c155227ecde11b746eef469e43d passed
Test Ok

@areshand areshand merged commit 8cefc2e into main Nov 12, 2024
92 checks passed
@areshand areshand deleted the new_replay_verify branch November 12, 2024 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:build-performance-images build performance docker image variants
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants