EOS EVM public endpoint cloud infrastructure documentation.
Caution
This repo is public, do not document sensitive information here!
Important
As an open-source software organization funded by and with obligations to our community, we make as much information publicly available as possible. However, sensitive details are described using labels that are distinct and definite without being determinate. Documentation in the private eos-evm-internal repo maps the indeterminate labels to our implementation-specific details. All of these details would be different for anyone else deploying this software stack anyways.
Contents
|
The community maintains the following endpoints for the public to interact with the EOS EVM.
Endpoint | Mainnet | Testnet | Notes |
---|---|---|---|
API | api.evm.eosnetwork.com |
api.testnet.evm.eosnetwork.com |
RPC API for tools like Frame, MetaMask, and Rabby to interact with the EOS EVM without running a full node. |
Bridge | bridge.evm.eosnetwork.com | bridge.testnet.evm.eosnetwork.com | Trustless bridge to move EOS tokens between the native chain and the EVM. |
Explorer | explorer.evm.eosnetwork.com | explorer.testnet.evm.eosnetwork.com | Block explorer and transaction viewer, running a fork of Blockscout. |
Faucet | - | faucet.testnet.evm.eosnetwork.com | Obtain EOS tokens for testing. The faucet is run by EOS Nation. |
Ownership ultimately lies with the community, which chose to use on-chain consensus mechanisms to delegate a leadership role over EOS EVM core software development and public endpoint operations to the EOS Network Foundation. The ENF collaborates with community contributors such as EOS Labs, EOS Nation, and independent contributors to accomplish these goals.
Responsibility for EOS EVM public endpoint operations is shared between several teams.
- ENF Automation
- Amazon Web Services (AWS) accounts, identity, and access management (IAM).
- Cloud network infrastructure for the API, bridge, and explorer.
- Cost analysis for ENF infrastructure.
- Domain Name Service (DNS) for all endpoints.
- ENF Engineering
- Compute.
- Core software development, including frontend.
- Database management.
- Deployment and upgrades of EVM core components.
- ENF Operations
- Billing for the API, bridge, and explorer.
- EOS Nation
- Faucet API, backend, billing, cloud infrastructure, and frontend.
Note
2024-03-07
EOS Labs recently volunteered to run the public endpoints. That means they will become responsible for all list elements above, except for core software development and probably the faucet.
The EOS EVM infrastructure is hosted on Amazon Web Services (AWS) and deployed manually.
There are currently two environments, a staging environment using the testnet chain and a production environment using the mainnet chain. Each environment is deployed to a different AWS account.
Environment | Chain | AWS Account |
---|---|---|
Production | EOS EVM Mainnet | evm-mainnet |
Staging | EOS EVM Testnet | evm-testnet |
The cloud network infrastructure is intentionally kept identical between all environments to increase the likelihood that bugs are discovered before changes are deployed to production.
Each environment spans multiple AWS regions, which are helpful to think of as datacenters.
Name | Region |
---|---|
ap |
Asia-Pacific |
us |
United States |
All systems use multiple availability zones (AZs) within each region, where applicable.
Tip
Globally distributed datacenters minimize the latency to users and maximize fault tolerance. Catastrophic failure of multiple availability zones in a single region is possible, both on accident and on purpose.
AWS supports user-defined names and tags to help identify resources.
Resource names are intended to be unique and specific enough that they are unambiguous, without being so specific that they aren't safe to discuss in an open forum. The naming schema is...
// AWS account name schema
account = `${product || repo}-${environment}`
// AWS resource name schema
resource = `${account}-${datacenter}-${system}-${service}-${component}-${version}`
...where:
- component - (optional) friendly name used to differentiate components, such as one virtual machine (VM) running a database and another VM running a web server.
- datacenter - documented above.
- environment - explained above.
- service - shorthand for the AWS service containing this resource.
- system - the larger system or deployment this resource is a part of.
- version - (optional) semantic version of software deployed to this resource, used only when resources are deployed concurrently with different versions.
Here are some examples.
evm-mainnet-ap-api-vm-miner-v0.1.1
evm-testnet-us-explorer-lb
In addition to the default tags populated by AWS, resources are tagged with the following to provide traceability.
Tag | Type | Deployment | Description |
---|---|---|---|
billing-use |
Enum | All | Used for cost analysis in the management account (e.g. evm-api ). |
branch |
String | Automated | The git branch containing the code for this resource, if any. |
build |
URL | Automated | The URL of the CI/CD build that deployed this resource. |
commit |
SHA-1 | Automated | The git commit containing the code for this resource, if any. |
email |
All | The email address of the individual who deployed this resource. | |
env |
Enum | All | The environment this resource belongs to (prod , staging , dev , etc.). |
manual |
Boolean | All | Whether this resource was deployed manually or by an automated system. |
repo |
URL | Any | The URL of the GitHub repository containing the code for this resource, if any. |
tag |
String | Any | The git tag containing the code for this resource, if any. |
ticket |
URL | Manual | The ticket authorizing this resource to be deployed. |
These tags can also be used as dimensions in the AWS cost analysis tool.
Each environment contains the following systems.
System | Architecture | Notes |
---|---|---|
API | Web Application | |
Bridge | Web Application | |
Explorer | Web Application | |
Faucet | External System | Testnet only. |
Metrics | AWS CloudWatch | |
Notifications | Event Handler |
The web applications are all deployed using the exact same components, so the web application architecture will be documented once and any system-specific deviations will be described along the way.
The web application documentation will start from the EOS EVM core software and work outwards to the Internet.
Component | Scope |
---|---|
Server | Availability Zone |
Security Group | Datacenter |
Virtual Private Cloud (VPC) | Datacenter |
Target Group | Datacenter |
Health Check | Datacenter |
Load Balancer | Datacenter |
TLS Security Policy | Datacenter |
X.509 Certificate | Datacenter |
Global Accelerator | Global |
Web Application Firewall | Datacenter |
DNS (Route 53) | Global |
Metrics (CloudWatch) | Datacenter |
The ENF Engineering team deploys the EOS EVM core software on a set of virtual machines (VMs) using Amazon EC2 instances. The Amazon Relational Database Service (RDS) is also used.
Note
2024-03-09 The author is not aware of any documentation describing the specific architecture or deployment process used for the public endpoint server infrastructure. However, the RPC API deployment does look somewhat like this.
Each VM exposes the application to the cloud network infrastructure through a network interface inside a virtual private cloud (VPC).
System | Port | Protocol | Usage |
---|---|---|---|
API | 80 8000 |
TCP HTTP |
API Traffic Health Check |
Bridge | 80 | HTTP | Web Traffic Health Check |
Explorer | 80 | HTTP | Web Traffic Health Check |
These ports are enforced by security groups, a simple transport and network layer AWS firewall service external to the VMs that is required and denies all traffic by default.
A target group defines a set of targets, all virtual machines in this case, to receive application traffic. This includes the port and protocol to be used for both application traffic and for health checks. Target groups will only route application traffic to targets that have satisfied the health checks.
Health checks are performed on a per-VM basis according to a specific set of user-defined rules. An HTTP or HTTPS request is sent to the VM using the specified port and path. The VM must respond in a specific amount of time with an accepted HTTP status code. Any payload included in the response is ignored.
System | Port | Path | Protocol | Status Code | Interval (seconds) | Timeout (seconds) | Success Threshold | Failure Threshold |
---|---|---|---|---|---|---|---|---|
API | 8000 | / |
HTTP | 200-299 | 30 | 5 | 5 responses | 2 requests |
Bridge | 80 | / |
HTTP | 200-299 | 30 | 5 | 5 responses | 2 requests |
Explorer | 80 | / |
HTTP | 200-299 | 30 | 5 | 5 responses | 2 requests |
A virtual machine must meet the success threshold using consecutive responses to transition into the healthy state and begin receiving traffic. The failure threshold is also determined using consecutive timeouts or bad status codes.
Important
Global Accelerator also performs health checks, which are configured separately but are intentionally kept as similar as possible.
An application load balancer (ALB) operates at the application layer of the OSI model to evenly distribute client traffic between one or more healthy targets, VMs in this case, as determined by target groups.
Tip
These ALBs use a routing algorithm known as "least outstanding requests" to distribute traffic among healthy targets within a target group. This algorithm is designed to distribute incoming requests evenly across all healthy targets within the target group based on the number of outstanding requests each target is currently serving. This helps optimize resource utilization and prevents any single target from being overwhelmed with traffic.
Logically, ALBs map client requests from listeners to target groups according to user-defined rules.
System | Listener | Rule |
API | 80 | HTTP 301 redirect to https://${host}:443/${path}?${query} . |
443 | Forward to target group. | |
Bridge | 80 | HTTP 301 redirect to https://${host}:443/${path}?${query} . |
443 | Forward to target group. | |
Explorer | 80 | HTTP 301 redirect to https://${host}:443/${path}?${query} . |
443 | Forward to target group. |
Tip
Application load balancers sit behind a security group, a simple network and transport layer firewall native to AWS that enforces these ports, just like VM network interfaces.
ALBs have some additional attributes.
System | Attribute | State | Description |
API | HTTP/2 | On | Support HTTP/2, and use it by default. |
Drop invalid header fields | Off | Strip invalid HTTP headers from requests. | |
Preserve HOST header |
Off | Preserve client HOST HTTP header for server. |
|
X-Forwarded-For header |
Append | Pack client information in the X-Forwarded-For HTTP header for the server. |
|
Client port preservation | On | Include client origin port in X-Forwarded-For header. |
|
TLS version and cipher headers | On | Include x-amzn-tls-version and x-amzn-tls-cipher-suite in X-Forwarded-For header. |
|
Bridge | HTTP/2 | On | Support HTTP/2, and use it by default. |
Drop invalid header fields | Off | Strip invalid HTTP headers from requests. | |
Preserve HOST header |
Off | Preserve client HOST HTTP header for server. |
|
X-Forwarded-For header |
Append | Pack client information in the X-Forwarded-For HTTP header for the server. |
|
Client port preservation | On | Include client origin port in X-Forwarded-For header. |
|
TLS version and cipher headers | On | Include x-amzn-tls-version and x-amzn-tls-cipher-suite in X-Forwarded-For header. |
|
Explorer | HTTP/2 | On | Support HTTP/2, and use it by default. |
Drop invalid header fields | Off | Strip invalid HTTP headers from requests. | |
Preserve HOST header |
Off | Preserve client HOST HTTP header for server. |
|
X-Forwarded-For header |
Append | Pack client information in the X-Forwarded-For HTTP header for the server. |
|
Client port preservation | On | Include client origin port in X-Forwarded-For header. |
|
TLS version and cipher headers | On | Include x-amzn-tls-version and x-amzn-tls-cipher-suite in X-Forwarded-For header. |
The EOS EVM public endpoints require clients to use HTTPS with TLS to connect. TLS termination is performed at the load balancers using an X.509 certificate from AWS Certificate Manager (ACM) according to rules defined by any one of several AWS-provided security policies.
Tip
The TLS specification requires servers to select the latest TLS version the client supports.
All TLS v1.3 andECDHE-*
cipher suites guarantee forward secrecy.
System | Security Policy | TLS Version | Cipher Suite |
API | ELBSecurityPolicy-TLS13-1-2-Res-2021-06 |
TLS v1.2 | ECDHE-ECDSA-AES128-GCM-SHA256 ECDHE-ECDSA-AES256-GCM-SHA384 ECDHE-RSA-AES128-GCM-SHA256 ECDHE-RSA-AES256-GCM-SHA384 |
TLS v1.3 | TLS_AES_128_GCM_SHA256 TLS_AES_256_GCM_SHA384 TLS_CHACHA20_POLY1305_SHA256 |
||
Bridge | ELBSecurityPolicy-TLS13-1-2-Res-2021-06 |
TLS v1.2 | ECDHE-ECDSA-AES128-GCM-SHA256 ECDHE-ECDSA-AES256-GCM-SHA384 ECDHE-RSA-AES128-GCM-SHA256 ECDHE-RSA-AES256-GCM-SHA384 |
TLS v1.3 | TLS_AES_128_GCM_SHA256 TLS_AES_256_GCM_SHA384 TLS_CHACHA20_POLY1305_SHA256 |
||
Explorer | ELBSecurityPolicy-TLS13-1-2-Res-2021-06 |
TLS v1.2 | ECDHE-ECDSA-AES128-GCM-SHA256 ECDHE-ECDSA-AES256-GCM-SHA384 ECDHE-RSA-AES128-GCM-SHA256 ECDHE-RSA-AES256-GCM-SHA384 |
TLS v1.3 | TLS_AES_128_GCM_SHA256 TLS_AES_256_GCM_SHA384 TLS_CHACHA20_POLY1305_SHA256 |
The load balancers obtain X.509 certificates from AWS Certificate Manager (ACM). These certificates are issued by Amazon Trust Services, a public certificate authority (CA) operated by Amazon Web Services, and are used by clients to verify the identity of the server.
Tip
Certificate authorities require proof of ownership over domain names in order to issue a certificate. This is accomplished with ACM by adding a
CNAME
DNS record for each domain name listed on a certificate. ACM will automatically renew the certificate as long as theCNAME
records are present.
Certificates are issued separately for each region in each environment.
Chain | Region | Certificate | Algorithm | Alternative Names |
mainnet | ap |
evm.eosnetwork.com |
ECDSA NIST P-384 |
evm.eosnetwork.com api.evm.eosnetwork.com bridge.evm.eosnetwork.com explorer.evm.eosnetwork.com |
us |
evm.eosnetwork.com |
ECDSA NIST P-384 |
evm.eosnetwork.com api.evm.eosnetwork.com bridge.evm.eosnetwork.com explorer.evm.eosnetwork.com |
|
testnet | ap |
testnet.evm.eosnetwork.com |
ECDSA NIST P-384 |
testnet.evm.eosnetwork.com api.testnet.evm.eosnetwork.com bridge.testnet.evm.eosnetwork.com explorer.testnet.evm.eosnetwork.com faucet.testnet.evm.eosnetwork.com |
us |
testnet.evm.eosnetwork.com |
ECDSA NIST P-384 |
testnet.evm.eosnetwork.com api.testnet.evm.eosnetwork.com bridge.testnet.evm.eosnetwork.com explorer.testnet.evm.eosnetwork.com faucet.testnet.evm.eosnetwork.com |
All domain names used across all systems are listed on each certificate so that if engineering shares resources between systems, such as calling the API or embedding the bridge on the explorer site, it will not violate browser content security policy or same-origin policy.
AWS Global Accelerator is used to improve the performance and availability of the EOS EVM public endpoints by ingesting client traffic at the nearest edge location, routing it to the closest healthy endpoint over private fiber as priority traffic instead of traversing the public Internet.
Global Accelerator determines where to send client requests according to the following rules.
- Proximity-Based Routing - route traffic to the datacenter nearest to the client in terms of network latency.
- Health-Based Routing - route traffic to the nearest healthy endpoint, providing automatic global failover in the event of partial service outage.
Note
Global Accelerator is not a content delivery network (CDN).
Work is planned to migrate the bridge and the explorer to a CDN.
Amazon claims:
For TCP traffic, as measured by third-party real user measurement tools at the 90th percentile (p90), Global Accelerator decreases first byte latency by up to 49%, jitter by up to 58%, and improves throughput by up to 60%. [...] "By enabling AWS Global Accelerator, one [...] customer saw a 51.2% reduction in mean end-to-end app load times." [Another customer] was able to "decrease response time from more than 200 milliseconds to less than 4 milliseconds, a 98% improvement."
Tip
AWS offers a nifty website that enables the user to race requests of various sizes over Global Accelerator against the public Internet to various regions around the world.
Global Accelerator offers some configuration options.
System | Listener | Target | Health Check | |||||
Port | Protocol | Load Balancer | Weight | Port | Protocol | Interval (seconds) |
Threshold | |
API | 80 443 |
TCP | *-ap-api-lb |
100 | 8000 | TCP | 30 | 3 requests |
*-us-api-lb |
100 | |||||||
Bridge | 80 443 |
TCP | *-ap-bridge-lb |
100 | 80 | TCP | 30 | 3 requests |
*-us-bridge-lb |
100 | |||||||
Explorer | 80 443 |
TCP | *-ap-explorer-lb |
100 | 80 | TCP | 30 | 3 requests |
*-us-explorer-lb |
100 |
Notice that, unlike load balancer and target group health checks, the success and failure thresholds are configured together.
A web application firewall (WAF) is a managed service that protects web applications by using heuristics to block malicious traffic which could affect availability, compromise security, or consume excessive resources. This can include mitigation for distributed denial-of-service (DDoS) attacks, cross-site scripting (XSS), and SQL injection attacks. The WAF can be enforced at the edge or at the load balancer.
Tip
In the AWS WAF web UI, the WAFs are called web application control lists (WACLs).
The domain name service (DNS) is a distributed system that translates human-readable domain names into IP addresses. The EOS EVM public endpoints use Amazon Route 53 to manage DNS records.
Control over evm.eosnetwork.com
and all subdomains is delegated to the evm-mainnet
AWS account by an external system. This account contains all DNS records for the mainnet endpoints along with records for certificate validation.
The evm-mainnet
account delegates control over testnet.evm.eosnetwork.com
and all subdomains to the evm-testnet
AWS account, which contains all DNS records for the testnet endpoints and certificate validation.
The EOS EVM testnet faucet is operated by EOS Nation and is hosted on their cloud infrastructure. As such, the only responsibility of the evm-testnet
AWS account is to provide DNS records for the faucet.
The EOS EVM public endpoints use Amazon CloudWatch for metrics, dashboards, and alarms. CloudWatch is an AWS-managed service with built-in integrations to all of the other managed services used by the public endpoints, so a large number of metrics are vacuumed up by default. This includes metrics on health checks, server resource utilization, endpoint traffic analysis, the nature of malicious traffic blocked by the web application firewall, and more. System and application logs can be ingested and analyzed for an additional fee.
Tip
Metrics are not currently exported to any platform-agnostic systems such as Prometheus and Grafana, but this was originally on the roadmap and will be done as the need arises.
Cost analysis is performed using the AWS Cost Explorer tool, which can be filtered by the tags described above.
CloudWatch alarms are used to notify stakeholders when specific metrics cross specific thresholds.
Chain | System | DC | Alarm | Fault Condition |
mainnet | API | ap |
evm-mainnet-ap-api-tg_sick-hosts |
UnHealthyHostCount ≥ 1 for 1 datapoint(s) within 1 minute. |
us |
evm-mainnet-us-api-tg_sick-hosts |
UnHealthyHostCount ≥ 1 for 1 datapoint(s) within 1 minute. |
||
Bridge | evm-mainnet-us-bridge-tg_sick-hosts |
UnHealthyHostCount ≥ 1 for 1 datapoint(s) within 1 minute. |
||
Explorer | evm-mainnet-us-explorer-tg_sick-hosts |
UnHealthyHostCount ≥ 1 for 1 datapoint(s) within 1 minute. |
||
testnet | API | ap |
evm-testnet-ap-api-tg_sick-hosts |
UnHealthyHostCount ≥ 1 for 1 datapoint(s) within 1 minute. |
us |
evm-testnet-us-api-tg_sick-hosts |
UnHealthyHostCount ≥ 1 for 1 datapoint(s) within 1 minute. |
||
Bridge | evm-testnet-us-bridge-tg_sick-hosts |
UnHealthyHostCount ≥ 1 for 1 datapoint(s) within 1 minute. |
||
Explorer | evm-testnet-us-explorer-tg_sick-hosts |
UnHealthyHostCount ≥ 1 for 1 datapoint(s) within 1 minute. |
All alarms defined in CloudWatch are automatically "seen" and handled by the notification system.
The EOS EVM public endpoint cloud infrastructure can automatically notify stakeholders about system health or security events using any combination of email, instant messaging (IM), and SMS.
CloudWatch automatically ingests metrics from all AWS managed services used by the public endpoints. These metrics can be used to define alarms. All alarms defined in CloudWatch are automatically "seen" and handled by the notification system.
This system ingests Cloudwatch alarm state-change events using Amazon EventBridge to put them in a Simple Notification Service (SNS) "topic," like a queue. Each event contains a JSON payload with information about the alarm and the context of the state change. The aws-cloudwatch-alarm-handler is an Amazon Lambda subscribed to this topic that takes the machine-readable event information, generates a human-readable message about the alarm, then puts this message in a second SNS topic called notify-evm
. This topic uses AWS-provided services to send the human-readable message out via email and SMS. Finally, the telegram-bot lambda is subscribed to the notify-evm
topic and sends the message to stakeholders via instant messaging (IM).
---
title: Notification System
---
flowchart TB
subgraph cw["`**CloudWatch**`"]
direction LR
metric["📊 Metric"]
alarm["🚨 Alarm"]
metric -.- alarm
end
alarm ---> |📈<br/>alarm state-change event| eventBus
subgraph eb["`**EventBridge**`"]
direction LR
eventBus["🚏 Event Bus<br/><br/><code>default</code>"]
rule["📜 Rule"]
eventBus -.- rule
end
eventBus ---> |📨| topic1
rule -.-x |source| cw
rule -.-x |target| topic1
subgraph sns1["`**SNS**`"]
direction TB
topic1["🟡 Topic<br/><br/>cloudwatch-alarm-state-change-event"]
subscription1["🗞️ Subscription"]
topic1 -.- subscription1
end
topic1 ---> |💽<br/>machine-readable event| function1
subscription1 -.-> function1
subgraph lambda1["`**Lambda**`"]
direction TB
function1["📖 Function<br/><br/><code>aws-cloudwatch-alarm-handler</code>"]
config1["Config"]
function1 -.- config1
end
function1 ---> |📄<br/>formatted text| topic2
subgraph sns2["`**SNS**`"]
direction TB
topic2["🟪 Topic<br/><br/>notify-evm"]
subscription2["🗞️ Subscription"]
topic2 -.- subscription2
end
topic2 ---> |📄| function2
topic2 ---> |📧| email
topic2 ---> |📶| sms
subscription2 -.-> function2
subgraph lambda2["`**Lambda**`"]
direction TB
function2["🤖 Function<br/><br/><code>telegram-bot</code>"]
config2["Config"]
function2 -.- config2
end
function2 ---> |💬| im
subgraph stakeholders["`**Stakeholders**`"]
direction TB
email["📥 Email"]
sms["📲 SMS"]
im["💻 IM"]
end
email -.-> |👤| subscription2
sms -.-> |👤| subscription2
im -.-> |🗝️👤| config2
alarm ~~~ rule
For more information about the notification system components, including examples of the JSON payload and human-readable messages, check out the aws-cloudwatch-alarm-handler and telegram-bot GitHub repositories.
Infrastructure changes are always deployed, one at a time, as follows.
- A maintenance window is scheduled with stakeholders, during which no other changes are taking place.
- This guarantees all stakeholders are informed.
- This reduces the number of independent variables, minimizing the time to resolution should service degradation be observed.
- Testnet endpoint functionality is verified using a virtual private network (VPN) to perform smoke tests against all affected endpoints, each from a number of different cities.
- The cities selected must exercise all datacenters.
- The set of cities should be large, to exercise content delivery networks (CDNs) or other edge compute.
- The cities used and results observed must be written down so the tests can be reproduced.
- If any tests fail then the deployment must be deferred until the system is in a known-good state.
- Changes are deployed to the testnet staging environment.
- Testnet endpoint functionality is validated using smoke tests from the same cities as before.
- A waiting period is observed.
- This gives the community time to identify and report bugs.
- This should be two business days to one week, and must be no less than twenty four (24) hours.
- Mainnet endpoint functionality is verified using smoke tests from a set of cities meeting the criteria above.
- Changes are deployed to the mainnet production environment.
- Mainnet endpoint functionality is validated using smoke tests from the same cities as before.
If service degradation is observed at any point in this process then all changes must be reverted, and the process must start over.
More resources.
../README.md
⤴- aws-cloudwatch-alarm-handler lambda
- eos-evm-internal - internal-facing documentation of a sensitive nature.
- Runbooks
- telegram-bot lambda
Legal Notice
This repo contains assets created in collaboration with a large language model, machine learning algorithm, or weak artificial intelligence (AI). This notice is required in some countries.