Skip to content

Commit

Permalink
Merge branch 'main' into ehp/gh-39-orch-log-rotate
Browse files Browse the repository at this point in the history
  • Loading branch information
ericpassmore committed Feb 26, 2024
2 parents 3d7ee75 + ed1471a commit 276e55b
Show file tree
Hide file tree
Showing 45 changed files with 2,352 additions and 225 deletions.
25 changes: 17 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,44 +21,53 @@ Select `LowEndOrchestrator` and use the default template.
![OrchTemplaceSelect](docs/images/CDOrchTemplateSelect.png)

## Updating Orchestrator Job Configuration
By default the setup will spin up a webservice with [Production Run from Nov 2023](meta-data/full-production-run-20231130.json). To change the job configuration you need to create your own JSON configuration, and restart the service to use the new JSON.
By default the setup will spin up a webservice with [Production Run from Nov 2023](meta-data/full-production-run-20231130.json). To change the job configuration you need to create your own JSON configuration, and restart the service to use the new JSON. **Note** need to use `nohup` on python webservice to keep the process running after ssh-shell exit.
- Create your own JSON following the example formate from `test-simple-jobs.json`
- Upload the file to the orchestrator node
- Log into the orchestrator node as `ubuntu` user
- Kill the existing service named `python3 ... webservice.py`
- Restart with your configuration `python3 $HOME/replay-test/orchestration-service/web_service.py --config my-config.json --host 0.0.0.0 --log ~/orch-complete-timings.log &`
- Restart with your configuration `nohup python3 $HOME/replay-test/orchestration-service/web_service.py --config my-config.json --host 0.0.0.0 --log ~/orch-complete-timings.log &`

## Replay Setup
You can spin up as many replay nodes as you need. Replay nodes will continuously pick and process new jobs. Each replay host works on one job at a time before picking up the next job. Therefore a small number of replay hosts will process all the jobs given enough time. For example, if there are 100 replay slices configured at most 100 replay hosts, and as few as 1 replay host, may be utilized.

To run the replay nodes ssh into the orchestrator node and run [run-replay-instance.sh](scripts/run-replay-instance.sh). The script takes two arguments the first is the number of replay hosts to spin up. The second argument indicates this is a dry run, and don't start up the hosts.
To run the replay nodes ssh into the orchestrator node and run [run-replay-instance.sh](scripts/replayhost/run-replay-instance.sh). The script takes two arguments the first is the number of replay hosts to spin up. The second argument indicates this is a dry run, and don't start up the hosts.
```
ssh -i private.key -l ubuntu orchestor
cd replay-test
scripts/run-replay-instance.sh 10 [DRY-RUN]
scripts/replayhost/run-replay-instance.sh 10 [DRY-RUN]
```

**Note**: It is important to run this script, as it injects the IP address of the orchestrator node into the replay nodes. Without this script you would need to manually update all the replay nodes with the IP address of the orchestrator.

## Web Dashboard
You can see the status of jobs, configuration, and summary of replay status by using the webservice on the orchestrator node. Navigate to `http://orchestor.example.com:4000/`.
You can see the status of jobs, configuration, and summary of replay status by using the webservice on the orchestrator node. Navigate to `http://orchestor.example.com/`.

Many HTTP calls support HTML, JSON, and Text responses. Look at [HTTP Service Calls](docs/http-service-calls.md) for other URL options and Accept encoding options.

## Termination of Replay Nodes
Replay nodes are not automatically terminated. To save on hosting costs, it is advisable to terminate the nodes after the replay tests are completed. Termination can be accomplished using the AWS dashboard.
Replay nodes are not automatically terminated. To save on hosting costs, it is advisable to terminate the nodes after the replay tests are completed. Termination can be accomplished using the AWS dashboard or by running the termination script.

```
ssh -i private.key -l ubuntu orchestor
cd replay-test
scripts/replayhost/terminate-replay-instance.sh ALL [DRY-RUN]
```

## Operating Details
See [Operating Details](docs/operating-details.md) for list of scripts, logs, and data.

## Testing
For testing options see [Running Tests](docs/running-tests.md)

## Generating Manifests
The python script `replay-test/scripts/generate_manifest_from_eosnation.py` will build a manifest off the list of eos nation snapshots. A manifest may be validated for valid JSON and a contiguous block range using the [validate_manifest.py](scripts/validate_manifest.py) script
The python script `replay-test/scripts/manifest/generate_manifest_from_eosnation.py` will build a manifest off the list of eos nation snapshots. A manifest may be validated for valid JSON and a contiguous block range using the [validate_manifest.py](scripts/manifest/validate_manifest.py) script

Redirect of stdout is recommended to separate the debug messages printed on stderr
`python3 generate_manifest_from_eosnation.py --source-net mainnet 1> ./manifest-config.json`

### Options
In this release `block-space-between-slices`, `max-block-height`, and `min-block-height` are experimental.
In this release `block-space-between-slices`, `max-block-height`, and `min-block-height`.

- `--source-net` Defaults to `mainnet`. Which chain to target. Options include mainnet, kylin, and jungle
- `--leap-version` Defaults to `5.0.0`. Specify the version of leap to use from the builds
Expand Down
34 changes: 34 additions & 0 deletions config/nginx-replay-test.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Default server configuration
#
server {
listen 80 default_server;
listen [::]:80 default_server;

# SSL configuration
#
# listen 443 ssl default_server;
# listen [::]:443 ssl default_server;
#

root /var/www/html;
index progress.html;
server_name _;

# pass these URLs to app
location ~ ^/(status|config|job|summary|healthcheck|replayhost|metrics|jobtimeoutcheck) {
proxy_buffering off;
proxy_pass http://127.0.0.1:4000;
}

# everything else serve static content
location / {
try_files $uri $uri/ =404;
}

# deny access to .htaccess files, if Apache's document root
# concurs with nginx's one
#
#location ~ /\.ht {
# deny all;
#}
}
3 changes: 2 additions & 1 deletion config/readonly-config.ini
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# NOTES on EVM - no private key, no eosio::txn_test_gen_plugin
# continued notes on EVM state_history_plugin requires --disable-replay-opts
eos-vm-oc-enable=auto
eos-vm-oc-enable=on
abi-serializer-max-time-ms = 15
chain-state-db-size-mb = 65536
# chain-threads = 2
Expand Down Expand Up @@ -32,6 +32,7 @@ enable-stale-production = false
resource-monitor-not-shutdown-on-threshold-exceeded=true
read-only-read-window-time-us = 150000
read-only-write-window-time-us = 50000
http-max-response-time-ms = 100000

# must have plugins
plugin = eosio::chain_api_plugin
Expand Down
7 changes: 6 additions & 1 deletion config/sync-config.ini
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# NOTES on EVM - no private key, no eosio::txn_test_gen_plugin
# continued notes on EVM state_history_plugin requires --disable-replay-opts
eos-vm-oc-enable=auto
eos-vm-oc-enable=on
abi-serializer-max-time-ms = 15
chain-state-db-size-mb = 65536
# chain-threads = 2
Expand Down Expand Up @@ -29,6 +29,11 @@ resource-monitor-not-shutdown-on-threshold-exceeded=true
read-only-read-window-time-us = 150000
read-only-write-window-time-us = 50000

# blocks logs managment
blocks-log-stride = 2000000
max-retained-block-files = 512
blocks-retained-dir = retained

# must have plugins
plugin = eosio::chain_api_plugin
plugin = eosio::chain_plugin
Expand Down
11 changes: 6 additions & 5 deletions docs/AWS-Host-Setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,15 @@
## Orchestrator Node
- We use unbuntu 22.04 OS on a t2.micro instance.
- You need to setup a private key for your host to support SSH.
- IAM access allows the node to spin up relay nodes on the command line via `aws ec2 run-instances`
- Security group opens port 4000 to private IP from replay Nodes
- Security group opens port 4000 and SSH to administrator IPs (Your IP)
- Elastic IP used to set a static public IP will be bound to DNS entry later
- IAM access allows the node to spin up replay nodes on the command line via `aws ec2 run-instances`
- Security group opens webservice to private IP from replay Nodes
- Security group opens webservice and SSH to administrator IPs (Your IP)
- The User Data setup script may be found under [`scripts/orchestrator-bootstrap.sh`](../scripts/orchestrator-bootstrap.sh)

## Replay Nodes
- We use unbuntu 22.04 OS on a TBD instance.
- Mount an additional 32Gb SSD EC2 Storage Instance (mounted as /data by `replay-node-bootstrap.sh`)
- We use unbuntu 22.04 OS on a m5a.8xlarge instance.
- Mount an additional gen2 32Gb SSD EC2 Storage Instance (mounted as /data by `replay-node-bootstrap.sh`)
- You need to setup a private key for your host to support SSH.
- IAM access allows the node to access S3 bucket (example `aws s3 ls`)
- Security group opens port SSH to orchestrator node, and administrator IPs (Your IP)
Expand Down
24 changes: 21 additions & 3 deletions docs/high-level-design.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

This service is designed to run multiple hosts to download, install, run nodeos. Following that the host will load a snapshot and start syncing blocks to a specific block number. Once that process is finished the host will report back the integrity hash representing nodeos' final state.

See [Operating Details](./operating-details.md) for list of scripts, logs, and data.

## Overview
Once replay hosts are spun up they contact the orchestration service to get the information needed to run their jobs. The replay hosts update the orchestration service with their progress and current status. The orchestration service is single threaded, and has checks to ensure there are not overwrites or race conditions. The replay nodes use increasing backoffs to avoid sending too many simultaneous requests.

Expand Down Expand Up @@ -32,6 +34,17 @@ UpdateRelStyle(replayD, orchestrator, $offsetY="-35", $offsetX="+30")
Rel(replayD, orchestrator, "gets jobs/sets status", "HTTP")
```

## Orchestrator Lifecycle

Current adding and tearing down replay hosts is a manual process that requires a human to log into the orchestrator host and run scripts.

Once the orchestrator is setup, and loaded with a configuration file, it has a list of jobs to run.
- Jobs all in `WAITING_4_WORKER` status
- Need replay hosts to pick and process jobs
The orchestrator spins up replay hosts via a command line script. This script passes the private ip of the orchestrator node along to the replay nodes.

At the end of the run, a script is called to terminate the replay node instances.

## Sequence
The relay host picks up a job, and updates the jobs status while it works through the lifecycle. A full list and description of the [HTTP API is documented separately](./http-service-calls.md). The relay host will update the progress by updating the last block processed. Full list of status is found here. https://github.com/eosnetworkfoundation/replay-test/blob/main/orchestration-service/job_status.py#L8-L15

Expand Down Expand Up @@ -61,11 +74,16 @@ sequenceDiagram
1. performs file setup: create dirs, get snapshot to load
2. GET job details from orchestration service, incls. block range
3. local non-priv install of nodeos
4. starts nodeos loads the snapshot
4. starts nodeos loads the snapshot, send integrity hash
- sends the integrity hash from the snapshot back to orchestration service
- the snapshot block height is the same as the start block height of this job
- this integrity hash is the expected integrity for another job
- the other job has the end block num matching the current job's start block num
5. replay transactions to specified block height from blocks.log or networked peers and terminates
6. restart nodeos read-only mode to get final integrity hash
7. POST completed status for configured block range
8. retain blocks logs copy over to cloud storage
- this is the actual integrity hash for this job
- the actual integrity hash for the this jobs end block num

Communicates to orchestration service via HTTP

Expand All @@ -76,4 +94,4 @@ Dependency on aws client, python3, curl, and large volume under /data
Final report shows
- number of blocks processed and percentage of total blocks processed
- number of successfully completed, failed jobs, and remaining jobs
- list of failed jobs with `Job Id`, `Configuration Slice`, and status
- list of failed jobs with `Job Id`, `Configuration Slice`, and `Status`
18 changes: 18 additions & 0 deletions docs/http-service-calls.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
- job - gets/sets configuration data for the replay nodes
- status - gets a replay nodes progress and state
- config - get/sets the configuration data used to initialize the job
- summary - progress of current run and reports any failed jobs
- healthcheck - gets 200/0K always

## Job
Expand Down Expand Up @@ -49,6 +50,23 @@ For the GET request when there are no parameters return statuses for all jobs. R
### POST
When running replay tests we don't always known the expected integrity hash. For example when state database is updated, which may come as part of an update the leap version. For that reason we take the integrity hash, after loading a snapshot, as the known good integrity hash at that block height. The `/config` POST request used the `end_block_num` in the body to look up the configuration slice. Following that the POST updates the configuration in memory and flushes back to disk. This persists the integrity hash as the known good, and expected value at `end_block_num`.

## Summary (Progress)

### GET
Returns the following
- number of blocks processed
- total number of blocks to process
- jobs completed
- jobs failed
- jobs remaining

In addition, lists the failed jobs with the status, links to job details, and config slice.

Content Type Support.
- If the Accepts header is text-html returns html
- If Accepts header is application/json returns json


## Healthcheck
Always returns same value used for healthchecks

Expand Down
46 changes: 46 additions & 0 deletions docs/operating-details.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Operating Details

Outlines the scripts that are called and where data, logs and configuration files live.

## Orchestration

All scripts are under the `ubunutu` user. The `replay-test` github repository is cloned into this directory.
### `Top Level Items`
- /home/ubuntu/orchestration-service/web_service.py : python http application
- /home/ubuntu/replay-test/scripts/replay-host/run-replay-instance.sh : script to spin up replay hosts
- /home/ubuntu/replay-test/scripts/replay-host/terminate-replay-instance.sh : script to terminate existing replay hosts
- /home/ubuntu/orchestration.log : log from orchestration service
- /home/ubuntu/aws-replay-instances.txt : instance id list of aws replay hosts, used by termination script
- /tmp/aws-run-instance-out.json : full json from `aws run-instance` command

### `Additional Items`
- /home/ubuntu/scripts/process_orchestration_log.py : parses log to produce stats on timing

## Replay hosts

All scripts are under the `enf-replay` user. The `replay-test` github repository is cloned into this directory.

### `Top Level Items`
- /home/enf-replay/orchestration-ip.txt : ip address of the orchestration service
- /tmp/replay.lock : lock file with pid of job that created the lock
- /home/enf-replay/replay-test/replay-client/replay_wrapper_script.sh : script the crontjob runs
- /home/enf-replay/replay-test/replay-client/start-nodeos-run-replay.sh : the script running the job
- /home/enf-replay/replay-test/config/*.ini : nodeos configuration files
- /data/nodeos/snapshot : location of snapshot to load
- /data/nodoes/data : data directory for nodeos
- /data/nodeos/log : log director for nodeos
- end_integrity_hash.txt : final integrity hash
- nodoes.log : log from syncing runing
- nodeos-readonly.log : log from readonly spinup of nodoes

### `Additional Items`
- /home/enf-replay/replay-test/replay-client/background_status_update.sh : background job that send progress updates to orchestration service
- /home/enf-replay/replay-test/replay-client/config_operations.py : python script to HTTP POST integrity hash updates
- /home/enf-replay/replay-test/replay-client/create-nodeos-dir-struct.sh : init dir structure
- /home/enf-replay/replay-test/replay-client/get_integrity_hash_from_log.sh : pull out the integrity hash from nodeos logs
- /home/enf-replay/replay-test/replay-client/head_block_num_from_log.sh : pull out the most recent block process from nodeos logs
- /home/enf-replay/replay-test/replay-client/install-nodoes.sh : pull down deb and install locally
- /home/enf-replay/replay-test/replay-client/job_operations.py : python script to HTTP POST job updates and status changes
- /home/enf-replay/replay-test/replay-client/manage_blocks_log.sh : script to retrieve blocks.log from cloud storage
- /home/enf-replay/replay-test/replay-client/parse_json.py : parses JSON to bridge access to JSON from shell scripts
- /home/enf-replay/replay-test/replay-client/replay-node-cleanup.sh : cleans out previous run, creates a blank slate
2 changes: 2 additions & 0 deletions docs/running-tests.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ sudo apt install -y unzip python3 python3-pip
pip install datetime argparse werkzeug
cd orchestration-service/test
./run-pytest.sh
echo $? # expect 0
```

### Details
Expand All @@ -42,6 +43,7 @@ sudo apt install -y unzip python3 python3-pip
pip install datetime argparse werkzeug
cd replay-client/tests
./run.sh
echo $? # expect 0
```

### Details
Expand Down
Loading

0 comments on commit 276e55b

Please sign in to comment.