Merge branch 'main' into ehp/gh-39-orch-log-rotate

eosnetworkfoundation · Feb 26, 2024 · 276e55b · 276e55b
2 parents 3d7ee75 + ed1471a
commit 276e55b
Show file tree

Hide file tree

Showing 45 changed files with 2,352 additions and 225 deletions.
diff --git a/README.md b/README.md
@@ -21,44 +21,53 @@ Select `LowEndOrchestrator` and use the default template.
 ![OrchTemplaceSelect](docs/images/CDOrchTemplateSelect.png)
 
 ## Updating Orchestrator Job Configuration
-By default the setup will spin up a webservice with [Production Run from Nov 2023](meta-data/full-production-run-20231130.json). To change the job configuration you need to create your own JSON configuration, and restart the service to use the new JSON.
+By default the setup will spin up a webservice with [Production Run from Nov 2023](meta-data/full-production-run-20231130.json). To change the job configuration you need to create your own JSON configuration, and restart the service to use the new JSON. **Note** need to use `nohup` on python webservice to keep the process running after ssh-shell exit.
 - Create your own JSON following the example formate from `test-simple-jobs.json`
 - Upload the file to the orchestrator node
 - Log into the orchestrator node as `ubuntu` user
 - Kill the existing service named `python3 ... webservice.py`
-- Restart with your configuration `python3 $HOME/replay-test/orchestration-service/web_service.py --config my-config.json --host 0.0.0.0 --log ~/orch-complete-timings.log &`
+- Restart with your configuration `nohup python3 $HOME/replay-test/orchestration-service/web_service.py --config my-config.json --host 0.0.0.0 --log ~/orch-complete-timings.log &`
 
 ## Replay Setup
 You can spin up as many replay nodes as you need. Replay nodes will continuously pick and process new jobs. Each replay host works on one job at a time before picking up the next job. Therefore a small number of replay hosts will process all the jobs given enough time. For example, if there are 100 replay slices configured at most 100 replay hosts, and as few as 1 replay host, may be utilized.
 
-To run the replay nodes ssh into the orchestrator node and run [run-replay-instance.sh](scripts/run-replay-instance.sh). The script takes two arguments the first is the number of replay hosts to spin up. The second argument indicates this is a dry run, and don't start up the hosts.
+To run the replay nodes ssh into the orchestrator node and run [run-replay-instance.sh](scripts/replayhost/run-replay-instance.sh). The script takes two arguments the first is the number of replay hosts to spin up. The second argument indicates this is a dry run, and don't start up the hosts.
 ```
 ssh -i private.key -l ubuntu orchestor
 cd replay-test
-scripts/run-replay-instance.sh 10 [DRY-RUN]
+scripts/replayhost/run-replay-instance.sh 10 [DRY-RUN]
 ```
 
 **Note**: It is important to run this script, as it injects the IP address of the orchestrator node into the replay nodes. Without this script you would need to manually update all the replay nodes with the IP address of the orchestrator.
 
 ## Web Dashboard
-You can see the status of jobs, configuration, and summary of replay status by using the webservice on the orchestrator node. Navigate to `http://orchestor.example.com:4000/`.
+You can see the status of jobs, configuration, and summary of replay status by using the webservice on the orchestrator node. Navigate to `http://orchestor.example.com/`.
 
 Many HTTP calls support HTML, JSON, and Text responses. Look at [HTTP Service Calls](docs/http-service-calls.md) for other URL options and Accept encoding options.
 
 ## Termination of Replay Nodes
-Replay nodes are not automatically terminated. To save on hosting costs, it is advisable to terminate the nodes after the replay tests are completed. Termination can be accomplished using the AWS dashboard.
+Replay nodes are not automatically terminated. To save on hosting costs, it is advisable to terminate the nodes after the replay tests are completed. Termination can be accomplished using the AWS dashboard or by running the termination script.
+
+```
+ssh -i private.key -l ubuntu orchestor
+cd replay-test
+scripts/replayhost/terminate-replay-instance.sh ALL [DRY-RUN]
+```
+
+## Operating Details
+See [Operating Details](docs/operating-details.md) for list of scripts, logs, and data.
 
 ## Testing
 For testing options see [Running Tests](docs/running-tests.md)
 
 ## Generating Manifests
-The python script `replay-test/scripts/generate_manifest_from_eosnation.py` will build a manifest off the list of eos nation snapshots. A manifest may be validated for valid JSON and a contiguous block range using the [validate_manifest.py](scripts/validate_manifest.py) script
+The python script `replay-test/scripts/manifest/generate_manifest_from_eosnation.py` will build a manifest off the list of eos nation snapshots. A manifest may be validated for valid JSON and a contiguous block range using the [validate_manifest.py](scripts/manifest/validate_manifest.py) script
 
 Redirect of stdout is recommended to separate the debug messages printed on stderr
 `python3 generate_manifest_from_eosnation.py --source-net mainnet 1> ./manifest-config.json`  
 
 ### Options
-In this release `block-space-between-slices`, `max-block-height`, and `min-block-height` are experimental.
+In this release `block-space-between-slices`, `max-block-height`, and `min-block-height`.
 
 - `--source-net` Defaults to `mainnet`. Which chain to target. Options include mainnet, kylin, and jungle
 - `--leap-version` Defaults to `5.0.0`. Specify the version of leap to use from the builds

diff --git a/config/nginx-replay-test.conf b/config/nginx-replay-test.conf
@@ -0,0 +1,34 @@
+# Default server configuration
+#
+server {
+	listen 80 default_server;
+	listen [::]:80 default_server;
+
+	# SSL configuration
+	#
+	# listen 443 ssl default_server;
+	# listen [::]:443 ssl default_server;
+	#
+
+	root /var/www/html;
+	index progress.html;
+	server_name _;
+
+  # pass these URLs to app
+	location ~ ^/(status|config|job|summary|healthcheck|replayhost|metrics|jobtimeoutcheck) {
+    proxy_buffering off;
+    proxy_pass http://127.0.0.1:4000;
+	}
+
+  # everything else serve static content
+  location / {
+    try_files $uri $uri/ =404;
+  }
+
+	# deny access to .htaccess files, if Apache's document root
+	# concurs with nginx's one
+	#
+	#location ~ /\.ht {
+	#	deny all;
+	#}
+}
diff --git a/config/readonly-config.ini b/config/readonly-config.ini
@@ -1,6 +1,6 @@
 # NOTES on EVM - no private key, no eosio::txn_test_gen_plugin
 # continued notes on EVM state_history_plugin requires --disable-replay-opts
-eos-vm-oc-enable=auto
+eos-vm-oc-enable=on
 abi-serializer-max-time-ms = 15
 chain-state-db-size-mb = 65536
 # chain-threads = 2
@@ -32,6 +32,7 @@ enable-stale-production = false
 resource-monitor-not-shutdown-on-threshold-exceeded=true
 read-only-read-window-time-us = 150000
 read-only-write-window-time-us = 50000
+http-max-response-time-ms = 100000
 
 # must have plugins
 plugin = eosio::chain_api_plugin

diff --git a/config/sync-config.ini b/config/sync-config.ini
@@ -1,6 +1,6 @@
 # NOTES on EVM - no private key, no eosio::txn_test_gen_plugin
 # continued notes on EVM state_history_plugin requires --disable-replay-opts
-eos-vm-oc-enable=auto
+eos-vm-oc-enable=on
 abi-serializer-max-time-ms = 15
 chain-state-db-size-mb = 65536
 # chain-threads = 2
@@ -29,6 +29,11 @@ resource-monitor-not-shutdown-on-threshold-exceeded=true
 read-only-read-window-time-us = 150000
 read-only-write-window-time-us = 50000
 
+# blocks logs managment
+blocks-log-stride = 2000000
+max-retained-block-files = 512
+blocks-retained-dir = retained
+
 # must have plugins
 plugin = eosio::chain_api_plugin
 plugin = eosio::chain_plugin

diff --git a/docs/AWS-Host-Setup.md b/docs/AWS-Host-Setup.md
@@ -3,14 +3,15 @@
 ## Orchestrator Node
 - We use unbuntu 22.04 OS on a t2.micro instance.
 - You need to setup a private key for your host to support SSH.
-- IAM access allows the node to spin up relay nodes on the command line via `aws ec2 run-instances`
-- Security group opens port 4000 to private IP from replay Nodes
-- Security group opens port 4000 and SSH to administrator IPs (Your IP)
+- Elastic IP used to set a static public IP will be bound to DNS entry later
+- IAM access allows the node to spin up replay nodes on the command line via `aws ec2 run-instances`
+- Security group opens webservice to private IP from replay Nodes
+- Security group opens webservice and SSH to administrator IPs (Your IP)
 - The User Data setup script may be found under [`scripts/orchestrator-bootstrap.sh`](../scripts/orchestrator-bootstrap.sh)
 
 ## Replay Nodes
-- We use unbuntu 22.04 OS on a TBD instance.
-- Mount an additional 32Gb SSD EC2 Storage Instance (mounted as /data by `replay-node-bootstrap.sh`)
+- We use unbuntu 22.04 OS on a m5a.8xlarge instance.
+- Mount an additional gen2 32Gb SSD EC2 Storage Instance (mounted as /data by `replay-node-bootstrap.sh`)
 - You need to setup a private key for your host to support SSH.
 - IAM access allows the node to access S3 bucket (example `aws s3 ls`)
 - Security group opens port SSH to orchestrator node, and administrator IPs (Your IP)

diff --git a/docs/high-level-design.md b/docs/high-level-design.md
@@ -2,6 +2,8 @@
 
 This service is designed to run multiple hosts to download, install, run nodeos. Following that the host will load a snapshot and start syncing blocks to a specific block number. Once that process is finished the host will report back the integrity hash representing nodeos' final state.
 
+See [Operating Details](./operating-details.md) for list of scripts, logs, and data.
+
 ## Overview
 Once replay hosts are spun up they contact the orchestration service to get the information needed to run their jobs. The replay hosts update the orchestration service with their progress and current status. The orchestration service is single threaded, and has checks to ensure there are not overwrites or race conditions. The replay nodes use increasing backoffs to avoid sending too many simultaneous requests.
 
@@ -32,6 +34,17 @@ UpdateRelStyle(replayD, orchestrator, $offsetY="-35", $offsetX="+30")
 Rel(replayD, orchestrator, "gets jobs/sets status", "HTTP")
 ```
 
+## Orchestrator Lifecycle
+
+Current adding and tearing down replay hosts is a manual process that requires a human to log into the orchestrator host and run scripts.
+
+Once the orchestrator is setup, and loaded with a configuration file, it has a list of jobs to run.
+- Jobs all in `WAITING_4_WORKER` status
+- Need replay hosts to pick and process jobs
+The orchestrator spins up replay hosts via a command line script. This script passes the private ip of the orchestrator node along to the replay nodes.
+
+At the end of the run, a script is called to terminate the replay node instances.
+
 ## Sequence
 The relay host picks up a job, and updates the jobs status while it works through the lifecycle. A full list and description of the [HTTP API is documented separately](./http-service-calls.md). The relay host will update the progress by updating the last block processed. Full list of status is found here. https://github.com/eosnetworkfoundation/replay-test/blob/main/orchestration-service/job_status.py#L8-L15
 
@@ -61,11 +74,16 @@ sequenceDiagram
 1. performs file setup: create dirs, get snapshot to load
 2. GET job details from orchestration service, incls. block range
 3. local non-priv install of nodeos
-4. starts nodeos loads the snapshot
+4. starts nodeos loads the snapshot, send integrity hash
+   - sends the integrity hash from the snapshot back to orchestration service
+   - the snapshot block height is the same as the start block height of this job
+   - this integrity hash is the expected integrity for another job
+   - the other job has the end block num matching the current job's start block num
 5. replay transactions to specified block height from blocks.log or networked peers and terminates
 6. restart nodeos read-only mode to get final integrity hash
 7. POST completed status for configured block range
-8. retain blocks logs copy over to cloud storage
+   - this is the actual integrity hash for this job
+   - the actual integrity hash for the this jobs end block num
 
 Communicates to orchestration service via HTTP
 
@@ -76,4 +94,4 @@ Dependency on aws client, python3, curl, and large volume under /data
 Final report shows
 - number of blocks processed and percentage of total blocks processed
 - number of successfully completed, failed jobs, and remaining jobs
-- list of failed jobs with `Job Id`, `Configuration Slice`, and status
+- list of failed jobs with `Job Id`, `Configuration Slice`, and `Status`
diff --git a/docs/http-service-calls.md b/docs/http-service-calls.md
@@ -4,6 +4,7 @@
 - job - gets/sets configuration data for the replay nodes
 - status - gets a replay nodes progress and state
 - config - get/sets the configuration data used to initialize the job
+- summary - progress of current run and reports any failed jobs
 - healthcheck - gets 200/0K always
 
 ## Job
@@ -49,6 +50,23 @@ For the GET request when there are no parameters return statuses for all jobs. R
 ### POST
 When running replay tests we don't always known the expected integrity hash. For example when state database is updated, which may come as part of an update the leap version. For that reason we take the integrity hash, after loading a snapshot, as the known good integrity hash at that block height. The `/config` POST request used the `end_block_num` in the body to look up the configuration slice. Following that the POST updates the configuration in memory and flushes back to disk. This persists the integrity hash as the known good, and expected value at `end_block_num`.
 
+## Summary (Progress)
+
+### GET
+Returns the following
+- number of blocks processed
+- total number of blocks to process
+- jobs completed
+- jobs failed
+- jobs remaining
+
+In addition, lists the failed jobs with the status, links to job details, and config slice.
+
+Content Type Support.
+- If the Accepts header is text-html returns html
+- If Accepts header is application/json returns json
+
+
 ## Healthcheck
 Always returns same value used for healthchecks
 

diff --git a/docs/operating-details.md b/docs/operating-details.md
@@ -0,0 +1,46 @@
+# Operating Details
+
+Outlines the scripts that are called and where data, logs and configuration files live.
+
+## Orchestration
+
+All scripts are under the `ubunutu` user. The `replay-test` github repository is cloned into this directory.
+### `Top Level Items`
+- /home/ubuntu/orchestration-service/web_service.py : python http application
+- /home/ubuntu/replay-test/scripts/replay-host/run-replay-instance.sh : script to spin up replay hosts
+- /home/ubuntu/replay-test/scripts/replay-host/terminate-replay-instance.sh : script to terminate existing replay hosts
+- /home/ubuntu/orchestration.log : log from orchestration service
+- /home/ubuntu/aws-replay-instances.txt : instance id list of aws replay hosts, used by termination script
+- /tmp/aws-run-instance-out.json : full json from `aws run-instance` command
+
+### `Additional Items`
+- /home/ubuntu/scripts/process_orchestration_log.py : parses log to produce stats on timing
+
+## Replay hosts
+
+All scripts are under the `enf-replay` user. The `replay-test` github repository is cloned into this directory.
+
+### `Top Level Items`
+- /home/enf-replay/orchestration-ip.txt : ip address of the orchestration service
+- /tmp/replay.lock : lock file with pid of job that created the lock
+- /home/enf-replay/replay-test/replay-client/replay_wrapper_script.sh : script the crontjob runs
+- /home/enf-replay/replay-test/replay-client/start-nodeos-run-replay.sh : the script running the job
+- /home/enf-replay/replay-test/config/*.ini : nodeos configuration files
+- /data/nodeos/snapshot : location of snapshot to load
+- /data/nodoes/data : data directory for nodeos
+- /data/nodeos/log : log director for nodeos
+  - end_integrity_hash.txt : final integrity hash
+  - nodoes.log : log from syncing runing
+  - nodeos-readonly.log : log from readonly spinup of nodoes
+
+  ### `Additional Items`
+  - /home/enf-replay/replay-test/replay-client/background_status_update.sh : background job that send progress updates to orchestration service
+  - /home/enf-replay/replay-test/replay-client/config_operations.py : python script to HTTP POST integrity hash updates
+  - /home/enf-replay/replay-test/replay-client/create-nodeos-dir-struct.sh : init dir structure
+  - /home/enf-replay/replay-test/replay-client/get_integrity_hash_from_log.sh : pull out the integrity hash from nodeos logs
+  - /home/enf-replay/replay-test/replay-client/head_block_num_from_log.sh : pull out the most recent block process from nodeos logs
+  - /home/enf-replay/replay-test/replay-client/install-nodoes.sh : pull down deb and install locally
+  - /home/enf-replay/replay-test/replay-client/job_operations.py : python script to HTTP POST job updates and status changes
+  - /home/enf-replay/replay-test/replay-client/manage_blocks_log.sh : script to retrieve blocks.log from cloud storage
+  - /home/enf-replay/replay-test/replay-client/parse_json.py : parses JSON to bridge access to JSON from shell scripts
+  - /home/enf-replay/replay-test/replay-client/replay-node-cleanup.sh : cleans out previous run, creates a blank slate
diff --git a/docs/running-tests.md b/docs/running-tests.md
@@ -21,6 +21,7 @@ sudo apt install -y unzip python3 python3-pip
 pip install datetime argparse werkzeug
 cd orchestration-service/test
 ./run-pytest.sh
+echo $? # expect 0
 ```
 
 ### Details
@@ -42,6 +43,7 @@ sudo apt install -y unzip python3 python3-pip
 pip install datetime argparse werkzeug
 cd replay-client/tests
 ./run.sh
+echo $? # expect 0
 ```
 
 ### Details