Skip to content

Commit

Permalink
Add exostellar infrastructure optimizer playbook
Browse files Browse the repository at this point in the history
Configure XIO

Resolves #226
  • Loading branch information
cartalla committed Oct 16, 2024
1 parent 5da1512 commit d21bf32
Show file tree
Hide file tree
Showing 31 changed files with 1,810 additions and 8 deletions.
9 changes: 7 additions & 2 deletions docs/deployment-prerequisites.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This page shows common prerequisites that need to be done before deployment.

The deployment process was developed and tested using Amazon Linux 2.
It has also been tested on RHEL 8 and RHEL 9.
An easy way to create a deployment instance is to use an AWS Cloud 9 desktop.
An easy way to create a deployment instance is to use AWS CloudShell.
This will give you a code editor IDE and shell environment that you can use to deploy the cluster.

If the required packages aren't installed then you will need sudo or root access on the instance.
Expand All @@ -20,7 +20,7 @@ You will needs AWS credentials that provide admin access to deploy the cluster.
Clone or download the aws-eda-slurm-cluster repository to your system.

```
git clone git@github.com:aws-samples/aws-eda-slurm-cluster.git
git clone https://github.com/aws-samples/aws-eda-slurm-cluster.git
```

## Create SNS Topic for Error Notifications (Optional but recommended)
Expand Down Expand Up @@ -229,6 +229,11 @@ when you create the file system.
If the file system already exists, then attach the appropriate security group to the network interfaces of
the file systems.

## Create Exostellar Management Server

If you're going to use Exostellar Infrastructure Optimizer (XIO) then you will need to deploy the Exostellar management server.
See the [XIO page](../exostellar-infrastructure-optimizer) for details.

## Create Configuration File

Before you deploy a cluster you need to create a configuration file.
Expand Down
192 changes: 192 additions & 0 deletions docs/exostellar-infrastructure-optimizer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
# Exostellar Infrastructure Optimizer

[Exostellar Infrastructure Optimizer](https://exostellar.io/infrastructureoptimizer-technical-information/) (XIO) runs applications in virtual machines (VMs) on EC2 instances and dynamically migrates the VMs between instances based on availability and cost.
Long-running, stateful jobs are not normally run on spot instances because of the risk of lost work after a spot termination.
XIO reduces this risk by predicting spot terminations and migrating the VM to another instance with higher availability.
This could be a different spot instance type or an on-demand instance.
When spot capacity becomes available again, the VM can be migrated back to a spot instance.
This allows you to save up to 90% over on-demand pricing by running on spot when capacity is available.
You increase the potential for savings by configuring as many spot capacity pools as possible.
This doesn't completely eliminate the risk of the job failing.
The job will still fail and need to be restarted from the beginning if a spot termination isn't predicted far enough in advance for the job to be migrated or if a new instance cannot be launched to migrate the job to.

XIO runs on an Exostellar Management Server (EMS).
The EMS runs a web application and launches and manages the instances that run jobs.
In response to job requests it launches controller nodes that manage pools of worker nodes.
The controller launches workers and then starts one or more VMs on the workers.
The controller also determines when VMs need to be migrated, allocates new workers, and manages the VM migrations.

You create an XIO Application Environment for each Slurm cluster.
The Application Environment contains the URL for the Slurm head node,
configures pools of VMs,
and configures the path to the Slurm binaries and configuration.
The VM pools define the attributes of the instances including the number of CPUs, VM Image, min and max memory, and an associated XIO Profile.

You must also create the XIO Profiles that are used by the VM Pools.
Each profile configures XIO Controllers and XIO Workers.
The Workers run the XIO VMs.
The Controller manages the workers and the VMs that run on them.
The Worker configuration includes the instance types to use for
on-demand and spot instances.
It also includes the security groups and tags for the worker instances.

You must also create XIO Images that are used to create the VMs.
The Images are created from AWS AMIs and are specified in the VM Pools.

**NOTE:** One current restriction of XIO VMs is that they cannot be created from ParallelCluster AMIs.
This is because the kernel modules that ParallelCluster installs aren't supported by the XIO hypervisor.

## XIO Configuration

This section will escribe the process of configuring XIO to work with ParallelCluster.

Refer to [Exostellar's documentation](https://docs.exostellar.io/latest/Latest/HPC-User/getting-started-installation) to make sure you have the latest instructions.

### Deploy ParallelCluster without configuring XIO

First deploy your cluster without configuring XIO.
The cluster deploys ansible playbooks that will be used to create the XIO ParallelCluster AMI.

### Install the Exostellar Management Server

The first step is to [install the Exostellar management server](https://docs.exostellar.io/latest/Latest/HPC-User/installing-management-server).
Exostellar will provide a link to a CloudFormation template that
will deploy the server in your account.

### Create XIO Configuration

The next step is to plan and configure your XIO deployment.
The key decisions that you must make are the instance types that you will use
and the AMI that you will use for the XIO VM Images.

XIO currently only supports x86_64 instance types and pools cannot mix AMD and Intel instance types.
The following XIO configuration for aws-eda-slurm-cluster shows 2 pools that contain Intel and AMD instances.
Note that we first define the XIO Profiles with instance types with the same manufacturer, number of cores, and amount of memory.
Then we configure pools for the Application Environment that use the profiles.
The numbers after the instance type are a priority to bias XIO to use higher priority instance types if they are available.
We've chose to prioritize the latest generation instance types so our jobs run faster and configure
older generation instance types at a lower priority to increase the number of capacity pools so that
we have a better chance of running on spot and having instances to run our jobs.
Refer to [Best practices for Amazon EC2 Spot](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-best-practices.html) when planning your cluster deployment and creating your configuration.

It is highly recommended to use [EC2 Spot placement scores](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/work-with-spot-placement-score.html) when selecting the region and availability zone for your cluster.
This will give you an indication of the likelihood of getting desired spot capacity.

```
slurm:
Xio:
ManagementServerStackName: exostellar-management-server
PartitionName: xio
AvailabilityZone: us-east-2b
Profiles:
- ProfileName: intel24core350g
NodeGroupName: intel24core350g
MaxControllers: 10
InstanceTypes:
- r5.12xlarge:1
- r5d.12xlarge:2
- r6i.12xlarge:3
- r6id.12xlarge:4
- r7i.12xlarge:5
- r7iz.12xlarge:6
SpotFleetTypes:
- r5.12xlarge:1
- r5d.12xlarge:2
- r6i.12xlarge:3
- r6id.12xlarge:4
- r7i.12xlarge:5
- r7iz.12xlarge:6
EnableHyperthreading: false
- ProfileName: amd24core350g
NodeGroupName: amd24core350g
MaxControllers: 10
InstanceTypes:
- r5a.12xlarge:1
- r5ad.12xlarge:2
- r6a.12xlarge:3
- r7a.12xlarge:5
SpotFleetTypes:
- r5a.12xlarge:1
- r5ad.12xlarge:2
- r6a.12xlarge:3
- r7a.12xlarge:5
EnableHyperthreading: false
Pools:
- PoolName: intel-24core-350G
ProfileName: intel24core350G
ImageName: res-demo-pc-3-10-1-rhel8-x86
PoolSize: 10
CPUs: 24
MinMemory: 350000
MaxMemory: 350000
- PoolName: amd-24core-350G
ProfileName: amd24core350G
ImageName: res-demo-pc-3-10-1-rhel8-x86
PoolSize: 10
CPUs: 24
MinMemory: 350000
MaxMemory: 350000
```

### Create XIO Profiles

In the EMS GUI copy the existing az1 profile to the profiles that you configured.
The name is all that matters.
The deployment will update the profile automatically from your configuration.


### Create the Application Environment

In the EMS GUI copy the **slurm** Application Environment to a new environment that is the same
name as your ParallelCluster cluster.
The deployment will update the application environment from your configuration.

### Create an XIO ParallelCluster AMI

Launch an instance using the base AMI for your OS.
For example, launch an instance with a base RHEL 8 or Rocky 8 AMI.

Mount the ParallelCluster NFS file system at /opt/slurm.

Run the ansible playbook to configure the instance for XIO.

```
/opt/slurm/config/bin/xio-compute-node-ami-configure.sh
```

Do any additional configuration that you require such as configuring file system mounts and installing
packages.

Create an AMI from the instance and wait for it to become available.

### Update the cluster with the XIO Iconfiguration

Update the cluster with the XIO configuration.

This will update the profiles and environment on the EMS server and configure the cluster for XIO.
The only remaining step before you can submit jobs is to create the XIO VM image.

### Create an XIO Image from the XIO ParallelCluster AMI

Connect to the head node and create the XIO Image from the AMI you created.
The IMAGE-NAME should be the same that you configured in the Pools.

```
/opt/slurm/etc/exostellar/parse_helper.sh -a <AMI-ID1> -i <IMAGE-NAME>
```

### Test launching an XIO VM

Connect to the head node and test launching a VM.
The pool, profile, and image_name should be from your configuration.
The host name doesn't matter.

```
/opt/slurm/etc/exostellar/teste_creasteVm.sh --pool <pool> --profile <profile> -i <image name> -h <host>
```

### Run a test job using Slurm

```
srun --pty -p xio-
```
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ nav:
- 'config.md'
- 'res_integration.md'
- 'soca_integration.md'
- 'exostellar-infrastructure-optimizer.md'
- 'custom-amis.md'
- 'run_jobs.md'
- 'job_preemption.md'
Expand Down
Loading

0 comments on commit d21bf32

Please sign in to comment.