Skip to content

Latest commit

 

History

History
844 lines (692 loc) · 39.4 KB

README.md

File metadata and controls

844 lines (692 loc) · 39.4 KB

Cloud Infrastructure

EOS EVM public endpoint cloud infrastructure documentation.

Caution

This repo is public, do not document sensitive information here!

Important

As an open-source software organization funded by and with obligations to our community, we make as much information publicly available as possible. However, sensitive details are described using labels that are distinct and definite without being determinate. Documentation in the private eos-evm-internal repo maps the indeterminate labels to our implementation-specific details. All of these details would be different for anyone else deploying this software stack anyways.

Contents

  1. Endpoints
  2. Ownership
  3. Context
    1. Environments
    2. Datacenters
    3. Resources
      1. Names
      2. Tags
  4. System Architecture
    1. Web Application
      1. Servers
      2. Ports
      3. Target Groups
      4. Health Checks
      5. Load Balancers
      6. TLS
      7. Certificates
      8. Global Accelerator
      9. Web Application Firewall
    2. DNS
    3. Faucet
    4. Metrics
      1. Alarms
    5. Notifications
  5. Deployment Strategy
  6. See Also

Endpoints

The community maintains the following endpoints for the public to interact with the EOS EVM.

Endpoint Mainnet Testnet Notes
API api.evm.eosnetwork.com api.testnet.evm.eosnetwork.com RPC API for tools like Frame, MetaMask, and Rabby to interact with the EOS EVM without running a full node.
Bridge bridge.evm.eosnetwork.com bridge.testnet.evm.eosnetwork.com Trustless bridge to move EOS tokens between the native chain and the EVM.
Explorer explorer.evm.eosnetwork.com explorer.testnet.evm.eosnetwork.com Block explorer and transaction viewer, running a fork of Blockscout.
Faucet - faucet.testnet.evm.eosnetwork.com Obtain EOS tokens for testing. The faucet is run by EOS Nation.

Ownership

Ownership ultimately lies with the community, which chose to use on-chain consensus mechanisms to delegate a leadership role over EOS EVM core software development and public endpoint operations to the EOS Network Foundation. The ENF collaborates with community contributors such as EOS Labs, EOS Nation, and independent contributors to accomplish these goals.

Responsibility for EOS EVM public endpoint operations is shared between several teams.

  • ENF Automation
    • Amazon Web Services (AWS) accounts, identity, and access management (IAM).
    • Cloud network infrastructure for the API, bridge, and explorer.
    • Cost analysis for ENF infrastructure.
    • Domain Name Service (DNS) for all endpoints.
  • ENF Engineering
    • Compute.
    • Core software development, including frontend.
    • Database management.
    • Deployment and upgrades of EVM core components.
  • ENF Operations
    • Billing for the API, bridge, and explorer.
  • EOS Nation
    • Faucet API, backend, billing, cloud infrastructure, and frontend.

Note

2024-03-07
EOS Labs recently volunteered to run the public endpoints. That means they will become responsible for all list elements above, except for core software development and probably the faucet.

Context

The EOS EVM infrastructure is hosted on Amazon Web Services (AWS) and deployed manually.

Environments

There are currently two environments, a staging environment using the testnet chain and a production environment using the mainnet chain. Each environment is deployed to a different AWS account.

Environment Chain AWS Account
Production EOS EVM Mainnet evm-mainnet
Staging EOS EVM Testnet evm-testnet

The cloud network infrastructure is intentionally kept identical between all environments to increase the likelihood that bugs are discovered before changes are deployed to production.

Datacenters

Each environment spans multiple AWS regions, which are helpful to think of as datacenters.

Name Region
ap Asia-Pacific
us United States

All systems use multiple availability zones (AZs) within each region, where applicable.

Tip

Globally distributed datacenters minimize the latency to users and maximize fault tolerance. Catastrophic failure of multiple availability zones in a single region is possible, both on accident and on purpose.

Resources

AWS supports user-defined names and tags to help identify resources.

Names

Resource names are intended to be unique and specific enough that they are unambiguous, without being so specific that they aren't safe to discuss in an open forum. The naming schema is...

// AWS account name schema
account = `${product || repo}-${environment}`
// AWS resource name schema
resource = `${account}-${datacenter}-${system}-${service}-${component}-${version}`

...where:

  • component - (optional) friendly name used to differentiate components, such as one virtual machine (VM) running a database and another VM running a web server.
  • datacenter - documented above.
  • environment - explained above.
  • service - shorthand for the AWS service containing this resource.
  • system - the larger system or deployment this resource is a part of.
  • version - (optional) semantic version of software deployed to this resource, used only when resources are deployed concurrently with different versions.

Here are some examples.

evm-mainnet-ap-api-vm-miner-v0.1.1
evm-testnet-us-explorer-lb

Tags

In addition to the default tags populated by AWS, resources are tagged with the following to provide traceability.

Tag Type Deployment Description
billing-use Enum All Used for cost analysis in the management account (e.g. evm-api).
branch String Automated The git branch containing the code for this resource, if any.
build URL Automated The URL of the CI/CD build that deployed this resource.
commit SHA-1 Automated The git commit containing the code for this resource, if any.
email Email All The email address of the individual who deployed this resource.
env Enum All The environment this resource belongs to (prod, staging, dev, etc.).
manual Boolean All Whether this resource was deployed manually or by an automated system.
repo URL Any The URL of the GitHub repository containing the code for this resource, if any.
tag String Any The git tag containing the code for this resource, if any.
ticket URL Manual The ticket authorizing this resource to be deployed.

These tags can also be used as dimensions in the AWS cost analysis tool.

System Architecture

Each environment contains the following systems.

System Architecture Notes
API Web Application
Bridge Web Application
Explorer Web Application
Faucet External System Testnet only.
Metrics AWS CloudWatch
Notifications Event Handler

The web applications are all deployed using the exact same components, so the web application architecture will be documented once and any system-specific deviations will be described along the way.

Web Application

The web application documentation will start from the EOS EVM core software and work outwards to the Internet.

Component Scope
Server Availability Zone
Security Group Datacenter
Virtual Private Cloud (VPC) Datacenter
Target Group Datacenter
Health Check Datacenter
Load Balancer Datacenter
TLS Security Policy Datacenter
X.509 Certificate Datacenter
Global Accelerator Global
Web Application Firewall Datacenter
DNS (Route 53) Global
Metrics (CloudWatch) Datacenter

Servers

The ENF Engineering team deploys the EOS EVM core software on a set of virtual machines (VMs) using Amazon EC2 instances. The Amazon Relational Database Service (RDS) is also used.

Note

2024-03-09 The author is not aware of any documentation describing the specific architecture or deployment process used for the public endpoint server infrastructure. However, the RPC API deployment does look somewhat like this.

Ports

Each VM exposes the application to the cloud network infrastructure through a network interface inside a virtual private cloud (VPC).

System Port Protocol Usage
API 80
8000
TCP
HTTP
API Traffic
Health Check
Bridge 80 HTTP Web Traffic
Health Check
Explorer 80 HTTP Web Traffic
Health Check

These ports are enforced by security groups, a simple transport and network layer AWS firewall service external to the VMs that is required and denies all traffic by default.

Target Groups

A target group defines a set of targets, all virtual machines in this case, to receive application traffic. This includes the port and protocol to be used for both application traffic and for health checks. Target groups will only route application traffic to targets that have satisfied the health checks.

Health Checks

Health checks are performed on a per-VM basis according to a specific set of user-defined rules. An HTTP or HTTPS request is sent to the VM using the specified port and path. The VM must respond in a specific amount of time with an accepted HTTP status code. Any payload included in the response is ignored.

System Port Path Protocol Status Code Interval (seconds) Timeout (seconds) Success Threshold Failure Threshold
API 8000 / HTTP 200-299 30 5 5 responses 2 requests
Bridge 80 / HTTP 200-299 30 5 5 responses 2 requests
Explorer 80 / HTTP 200-299 30 5 5 responses 2 requests

A virtual machine must meet the success threshold using consecutive responses to transition into the healthy state and begin receiving traffic. The failure threshold is also determined using consecutive timeouts or bad status codes.

Important

Global Accelerator also performs health checks, which are configured separately but are intentionally kept as similar as possible.

Load Balancers

An application load balancer (ALB) operates at the application layer of the OSI model to evenly distribute client traffic between one or more healthy targets, VMs in this case, as determined by target groups.

Tip

These ALBs use a routing algorithm known as "least outstanding requests" to distribute traffic among healthy targets within a target group. This algorithm is designed to distribute incoming requests evenly across all healthy targets within the target group based on the number of outstanding requests each target is currently serving. This helps optimize resource utilization and prevents any single target from being overwhelmed with traffic.

evm-testnet-api-us-lb

Logically, ALBs map client requests from listeners to target groups according to user-defined rules.

System Listener Rule
API 80 HTTP 301 redirect to https://${host}:443/${path}?${query}.
443 Forward to target group.
Bridge 80 HTTP 301 redirect to https://${host}:443/${path}?${query}.
443 Forward to target group.
Explorer 80 HTTP 301 redirect to https://${host}:443/${path}?${query}.
443 Forward to target group.

Tip

Application load balancers sit behind a security group, a simple network and transport layer firewall native to AWS that enforces these ports, just like VM network interfaces.

ALBs have some additional attributes.

System Attribute State Description
API HTTP/2 On Support HTTP/2, and use it by default.
Drop invalid header fields Off Strip invalid HTTP headers from requests.
Preserve HOST header Off Preserve client HOST HTTP header for server.
X-Forwarded-For header Append Pack client information in the X-Forwarded-For HTTP header for the server.
Client port preservation On Include client origin port in X-Forwarded-For header.
TLS version and cipher headers On Include x-amzn-tls-version and x-amzn-tls-cipher-suite in X-Forwarded-For header.
Bridge HTTP/2 On Support HTTP/2, and use it by default.
Drop invalid header fields Off Strip invalid HTTP headers from requests.
Preserve HOST header Off Preserve client HOST HTTP header for server.
X-Forwarded-For header Append Pack client information in the X-Forwarded-For HTTP header for the server.
Client port preservation On Include client origin port in X-Forwarded-For header.
TLS version and cipher headers On Include x-amzn-tls-version and x-amzn-tls-cipher-suite in X-Forwarded-For header.
Explorer HTTP/2 On Support HTTP/2, and use it by default.
Drop invalid header fields Off Strip invalid HTTP headers from requests.
Preserve HOST header Off Preserve client HOST HTTP header for server.
X-Forwarded-For header Append Pack client information in the X-Forwarded-For HTTP header for the server.
Client port preservation On Include client origin port in X-Forwarded-For header.
TLS version and cipher headers On Include x-amzn-tls-version and x-amzn-tls-cipher-suite in X-Forwarded-For header.

TLS

The EOS EVM public endpoints require clients to use HTTPS with TLS to connect. TLS termination is performed at the load balancers using an X.509 certificate from AWS Certificate Manager (ACM) according to rules defined by any one of several AWS-provided security policies.

Tip

The TLS specification requires servers to select the latest TLS version the client supports.
All TLS v1.3 and ECDHE-* cipher suites guarantee forward secrecy.

System Security Policy TLS Version Cipher Suite
API ELBSecurityPolicy-TLS13-1-2-Res-2021-06 TLS v1.2 ECDHE-ECDSA-AES128-GCM-SHA256
ECDHE-ECDSA-AES256-GCM-SHA384
ECDHE-RSA-AES128-GCM-SHA256
ECDHE-RSA-AES256-GCM-SHA384
TLS v1.3 TLS_AES_128_GCM_SHA256
TLS_AES_256_GCM_SHA384
TLS_CHACHA20_POLY1305_SHA256
Bridge ELBSecurityPolicy-TLS13-1-2-Res-2021-06 TLS v1.2 ECDHE-ECDSA-AES128-GCM-SHA256
ECDHE-ECDSA-AES256-GCM-SHA384
ECDHE-RSA-AES128-GCM-SHA256
ECDHE-RSA-AES256-GCM-SHA384
TLS v1.3 TLS_AES_128_GCM_SHA256
TLS_AES_256_GCM_SHA384
TLS_CHACHA20_POLY1305_SHA256
Explorer ELBSecurityPolicy-TLS13-1-2-Res-2021-06 TLS v1.2 ECDHE-ECDSA-AES128-GCM-SHA256
ECDHE-ECDSA-AES256-GCM-SHA384
ECDHE-RSA-AES128-GCM-SHA256
ECDHE-RSA-AES256-GCM-SHA384
TLS v1.3 TLS_AES_128_GCM_SHA256
TLS_AES_256_GCM_SHA384
TLS_CHACHA20_POLY1305_SHA256

Certificates

The load balancers obtain X.509 certificates from AWS Certificate Manager (ACM). These certificates are issued by Amazon Trust Services, a public certificate authority (CA) operated by Amazon Web Services, and are used by clients to verify the identity of the server.

Tip

Certificate authorities require proof of ownership over domain names in order to issue a certificate. This is accomplished with ACM by adding a CNAME DNS record for each domain name listed on a certificate. ACM will automatically renew the certificate as long as the CNAME records are present.

Certificates are issued separately for each region in each environment.

Chain Region Certificate Algorithm Alternative Names
mainnet ap evm.eosnetwork.com ECDSA
NIST P-384
evm.eosnetwork.com
api.evm.eosnetwork.com
bridge.evm.eosnetwork.com
explorer.evm.eosnetwork.com
us evm.eosnetwork.com ECDSA
NIST P-384
evm.eosnetwork.com
api.evm.eosnetwork.com
bridge.evm.eosnetwork.com
explorer.evm.eosnetwork.com
testnet ap testnet.evm.eosnetwork.com ECDSA
NIST P-384
testnet.evm.eosnetwork.com
api.testnet.evm.eosnetwork.com
bridge.testnet.evm.eosnetwork.com
explorer.testnet.evm.eosnetwork.com
faucet.testnet.evm.eosnetwork.com
us testnet.evm.eosnetwork.com ECDSA
NIST P-384
testnet.evm.eosnetwork.com
api.testnet.evm.eosnetwork.com
bridge.testnet.evm.eosnetwork.com
explorer.testnet.evm.eosnetwork.com
faucet.testnet.evm.eosnetwork.com

All domain names used across all systems are listed on each certificate so that if engineering shares resources between systems, such as calling the API or embedding the bridge on the explorer site, it will not violate browser content security policy or same-origin policy.

Global Accelerator

AWS Global Accelerator is used to improve the performance and availability of the EOS EVM public endpoints by ingesting client traffic at the nearest edge location, routing it to the closest healthy endpoint over private fiber as priority traffic instead of traversing the public Internet.

Global Accelerator determines where to send client requests according to the following rules.

  1. Proximity-Based Routing - route traffic to the datacenter nearest to the client in terms of network latency.
  2. Health-Based Routing - route traffic to the nearest healthy endpoint, providing automatic global failover in the event of partial service outage.

Note

Global Accelerator is not a content delivery network (CDN).
Work is planned to migrate the bridge and the explorer to a CDN.

Amazon claims:

For TCP traffic, as measured by third-party real user measurement tools at the 90th percentile (p90), Global Accelerator decreases first byte latency by up to 49%, jitter by up to 58%, and improves throughput by up to 60%. [...] "By enabling AWS Global Accelerator, one [...] customer saw a 51.2% reduction in mean end-to-end app load times." [Another customer] was able to "decrease response time from more than 200 milliseconds to less than 4 milliseconds, a 98% improvement."

Tip

AWS offers a nifty website that enables the user to race requests of various sizes over Global Accelerator against the public Internet to various regions around the world.

Global Accelerator offers some configuration options.

System Listener Target Health Check
Port Protocol Load Balancer Weight Port Protocol Interval
(seconds)
Threshold
API 80
443
TCP *-ap-api-lb 100 8000 TCP 30 3 requests
*-us-api-lb 100
Bridge 80
443
TCP *-ap-bridge-lb 100 80 TCP 30 3 requests
*-us-bridge-lb 100
Explorer 80
443
TCP *-ap-explorer-lb 100 80 TCP 30 3 requests
*-us-explorer-lb 100

Notice that, unlike load balancer and target group health checks, the success and failure thresholds are configured together.

Web Application Firewall

A web application firewall (WAF) is a managed service that protects web applications by using heuristics to block malicious traffic which could affect availability, compromise security, or consume excessive resources. This can include mitigation for distributed denial-of-service (DDoS) attacks, cross-site scripting (XSS), and SQL injection attacks. The WAF can be enforced at the edge or at the load balancer.

Tip

In the AWS WAF web UI, the WAFs are called web application control lists (WACLs).

DNS

The domain name service (DNS) is a distributed system that translates human-readable domain names into IP addresses. The EOS EVM public endpoints use Amazon Route 53 to manage DNS records.

Control over evm.eosnetwork.com and all subdomains is delegated to the evm-mainnet AWS account by an external system. This account contains all DNS records for the mainnet endpoints along with records for certificate validation.

The evm-mainnet account delegates control over testnet.evm.eosnetwork.com and all subdomains to the evm-testnet AWS account, which contains all DNS records for the testnet endpoints and certificate validation.

Faucet

The EOS EVM testnet faucet is operated by EOS Nation and is hosted on their cloud infrastructure. As such, the only responsibility of the evm-testnet AWS account is to provide DNS records for the faucet.

Metrics

The EOS EVM public endpoints use Amazon CloudWatch for metrics, dashboards, and alarms. CloudWatch is an AWS-managed service with built-in integrations to all of the other managed services used by the public endpoints, so a large number of metrics are vacuumed up by default. This includes metrics on health checks, server resource utilization, endpoint traffic analysis, the nature of malicious traffic blocked by the web application firewall, and more. System and application logs can be ingested and analyzed for an additional fee.

Tip

Metrics are not currently exported to any platform-agnostic systems such as Prometheus and Grafana, but this was originally on the roadmap and will be done as the need arises.

Cost analysis is performed using the AWS Cost Explorer tool, which can be filtered by the tags described above.

Alarms

CloudWatch alarms are used to notify stakeholders when specific metrics cross specific thresholds.

Chain System DC Alarm Fault Condition
mainnet API ap evm-mainnet-ap-api-tg_sick-hosts UnHealthyHostCount ≥ 1 for 1 datapoint(s) within 1 minute.
us evm-mainnet-us-api-tg_sick-hosts UnHealthyHostCount ≥ 1 for 1 datapoint(s) within 1 minute.
Bridge evm-mainnet-us-bridge-tg_sick-hosts UnHealthyHostCount ≥ 1 for 1 datapoint(s) within 1 minute.
Explorer evm-mainnet-us-explorer-tg_sick-hosts UnHealthyHostCount ≥ 1 for 1 datapoint(s) within 1 minute.
testnet API ap evm-testnet-ap-api-tg_sick-hosts UnHealthyHostCount ≥ 1 for 1 datapoint(s) within 1 minute.
us evm-testnet-us-api-tg_sick-hosts UnHealthyHostCount ≥ 1 for 1 datapoint(s) within 1 minute.
Bridge evm-testnet-us-bridge-tg_sick-hosts UnHealthyHostCount ≥ 1 for 1 datapoint(s) within 1 minute.
Explorer evm-testnet-us-explorer-tg_sick-hosts UnHealthyHostCount ≥ 1 for 1 datapoint(s) within 1 minute.

All alarms defined in CloudWatch are automatically "seen" and handled by the notification system.

Notifications

The EOS EVM public endpoint cloud infrastructure can automatically notify stakeholders about system health or security events using any combination of email, instant messaging (IM), and SMS.

CloudWatch automatically ingests metrics from all AWS managed services used by the public endpoints. These metrics can be used to define alarms. All alarms defined in CloudWatch are automatically "seen" and handled by the notification system.

This system ingests Cloudwatch alarm state-change events using Amazon EventBridge to put them in a Simple Notification Service (SNS) "topic," like a queue. Each event contains a JSON payload with information about the alarm and the context of the state change. The aws-cloudwatch-alarm-handler is an Amazon Lambda subscribed to this topic that takes the machine-readable event information, generates a human-readable message about the alarm, then puts this message in a second SNS topic called notify-evm. This topic uses AWS-provided services to send the human-readable message out via email and SMS. Finally, the telegram-bot lambda is subscribed to the notify-evm topic and sends the message to stakeholders via instant messaging (IM).

---
title: Notification System
---
flowchart TB
    subgraph cw["`**CloudWatch**`"]
        direction LR
        metric["📊 Metric"]
        alarm["🚨 Alarm"]
        metric -.- alarm
    end

    alarm ---> |📈<br/>alarm state-change event| eventBus

    subgraph eb["`**EventBridge**`"]
        direction LR
        eventBus["🚏 Event Bus<br/><br/><code>default</code>"]
        rule["📜 Rule"]
        eventBus -.- rule
    end

    eventBus ---> |📨| topic1
    rule -.-x |source| cw
    rule -.-x |target| topic1

    subgraph sns1["`**SNS**`"]
        direction TB
        topic1["🟡 Topic<br/><br/>cloudwatch-alarm-state-change-event"]
        subscription1["🗞️ Subscription"]
        topic1 -.- subscription1
    end

    topic1 ---> |💽<br/>machine-readable event| function1
    subscription1 -.-> function1

    subgraph lambda1["`**Lambda**`"]
        direction TB
        function1["📖 Function<br/><br/><code>aws-cloudwatch-alarm-handler</code>"]
        config1["Config"]
        function1 -.- config1
    end

    function1 ---> |📄<br/>formatted text| topic2

    subgraph sns2["`**SNS**`"]
        direction TB
        topic2["🟪 Topic<br/><br/>notify-evm"]
        subscription2["🗞️ Subscription"]
        topic2 -.- subscription2
    end

    topic2 ---> |📄| function2
    topic2 ---> |📧| email
    topic2 ---> |📶| sms
    subscription2 -.-> function2

    subgraph lambda2["`**Lambda**`"]
        direction TB
        function2["🤖 Function<br/><br/><code>telegram-bot</code>"]
        config2["Config"]
        function2 -.- config2
    end

    function2 ---> |💬| im

    subgraph stakeholders["`**Stakeholders**`"]
        direction TB
        email["📥 Email"]
        sms["📲 SMS"]
        im["💻 IM"]
    end

    email -.-> |👤| subscription2
    sms -.-> |👤| subscription2
    im -.-> |🗝️👤| config2

    alarm ~~~ rule
Loading

For more information about the notification system components, including examples of the JSON payload and human-readable messages, check out the aws-cloudwatch-alarm-handler and telegram-bot GitHub repositories.

Deployment Strategy

Infrastructure changes are always deployed, one at a time, as follows.

  1. A maintenance window is scheduled with stakeholders, during which no other changes are taking place.
    • This guarantees all stakeholders are informed.
    • This reduces the number of independent variables, minimizing the time to resolution should service degradation be observed.
  2. Testnet endpoint functionality is verified using a virtual private network (VPN) to perform smoke tests against all affected endpoints, each from a number of different cities.
    • The cities selected must exercise all datacenters.
    • The set of cities should be large, to exercise content delivery networks (CDNs) or other edge compute.
    • The cities used and results observed must be written down so the tests can be reproduced.
    • If any tests fail then the deployment must be deferred until the system is in a known-good state.
  3. Changes are deployed to the testnet staging environment.
  4. Testnet endpoint functionality is validated using smoke tests from the same cities as before.
  5. A waiting period is observed.
    • This gives the community time to identify and report bugs.
    • This should be two business days to one week, and must be no less than twenty four (24) hours.
  6. Mainnet endpoint functionality is verified using smoke tests from a set of cities meeting the criteria above.
  7. Changes are deployed to the mainnet production environment.
  8. Mainnet endpoint functionality is validated using smoke tests from the same cities as before.

If service degradation is observed at any point in this process then all changes must be reverted, and the process must start over.

See Also

More resources.


Legal Notice
This repo contains assets created in collaboration with a large language model, machine learning algorithm, or weak artificial intelligence (AI). This notice is required in some countries.