-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add exostellar infrastructure optimizer playbook
Configure XIO Resolves #226
- Loading branch information
Showing
31 changed files
with
1,810 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,192 @@ | ||
# Exostellar Infrastructure Optimizer | ||
|
||
[Exostellar Infrastructure Optimizer](https://exostellar.io/infrastructureoptimizer-technical-information/) (XIO) runs applications in virtual machines (VMs) on EC2 instances and dynamically migrates the VMs between instances based on availability and cost. | ||
Long-running, stateful jobs are not normally run on spot instances because of the risk of lost work after a spot termination. | ||
XIO reduces this risk by predicting spot terminations and migrating the VM to another instance with higher availability. | ||
This could be a different spot instance type or an on-demand instance. | ||
When spot capacity becomes available again, the VM can be migrated back to a spot instance. | ||
This allows you to save up to 90% over on-demand pricing by running on spot when capacity is available. | ||
You increase the potential for savings by configuring as many spot capacity pools as possible. | ||
This doesn't completely eliminate the risk of the job failing. | ||
The job will still fail and need to be restarted from the beginning if a spot termination isn't predicted far enough in advance for the job to be migrated or if a new instance cannot be launched to migrate the job to. | ||
|
||
XIO runs on an Exostellar Management Server (EMS). | ||
The EMS runs a web application and launches and manages the instances that run jobs. | ||
In response to job requests it launches controller nodes that manage pools of worker nodes. | ||
The controller launches workers and then starts one or more VMs on the workers. | ||
The controller also determines when VMs need to be migrated, allocates new workers, and manages the VM migrations. | ||
|
||
You create an XIO Application Environment for each Slurm cluster. | ||
The Application Environment contains the URL for the Slurm head node, | ||
configures pools of VMs, | ||
and configures the path to the Slurm binaries and configuration. | ||
The VM pools define the attributes of the instances including the number of CPUs, VM Image, min and max memory, and an associated XIO Profile. | ||
|
||
You must also create the XIO Profiles that are used by the VM Pools. | ||
Each profile configures XIO Controllers and XIO Workers. | ||
The Workers run the XIO VMs. | ||
The Controller manages the workers and the VMs that run on them. | ||
The Worker configuration includes the instance types to use for | ||
on-demand and spot instances. | ||
It also includes the security groups and tags for the worker instances. | ||
|
||
You must also create XIO Images that are used to create the VMs. | ||
The Images are created from AWS AMIs and are specified in the VM Pools. | ||
|
||
**NOTE:** One current restriction of XIO VMs is that they cannot be created from ParallelCluster AMIs. | ||
This is because the kernel modules that ParallelCluster installs aren't supported by the XIO hypervisor. | ||
|
||
## XIO Configuration | ||
|
||
This section will escribe the process of configuring XIO to work with ParallelCluster. | ||
|
||
Refer to [Exostellar's documentation](https://docs.exostellar.io/latest/Latest/HPC-User/getting-started-installation) to make sure you have the latest instructions. | ||
|
||
### Deploy ParallelCluster without configuring XIO | ||
|
||
First deploy your cluster without configuring XIO. | ||
The cluster deploys ansible playbooks that will be used to create the XIO ParallelCluster AMI. | ||
|
||
### Install the Exostellar Management Server | ||
|
||
The first step is to [install the Exostellar management server](https://docs.exostellar.io/latest/Latest/HPC-User/installing-management-server). | ||
Exostellar will provide a link to a CloudFormation template that | ||
will deploy the server in your account. | ||
|
||
### Create XIO Configuration | ||
|
||
The next step is to plan and configure your XIO deployment. | ||
The key decisions that you must make are the instance types that you will use | ||
and the AMI that you will use for the XIO VM Images. | ||
|
||
XIO currently only supports x86_64 instance types and pools cannot mix AMD and Intel instance types. | ||
The following XIO configuration for aws-eda-slurm-cluster shows 2 pools that contain Intel and AMD instances. | ||
Note that we first define the XIO Profiles with instance types with the same manufacturer, number of cores, and amount of memory. | ||
Then we configure pools for the Application Environment that use the profiles. | ||
The numbers after the instance type are a priority to bias XIO to use higher priority instance types if they are available. | ||
We've chose to prioritize the latest generation instance types so our jobs run faster and configure | ||
older generation instance types at a lower priority to increase the number of capacity pools so that | ||
we have a better chance of running on spot and having instances to run our jobs. | ||
Refer to [Best practices for Amazon EC2 Spot](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-best-practices.html) when planning your cluster deployment and creating your configuration. | ||
|
||
It is highly recommended to use [EC2 Spot placement scores](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/work-with-spot-placement-score.html) when selecting the region and availability zone for your cluster. | ||
This will give you an indication of the likelihood of getting desired spot capacity. | ||
|
||
``` | ||
slurm: | ||
Xio: | ||
ManagementServerStackName: exostellar-management-server | ||
PartitionName: xio | ||
AvailabilityZone: us-east-2b | ||
Profiles: | ||
- ProfileName: intel24core350g | ||
NodeGroupName: intel24core350g | ||
MaxControllers: 10 | ||
InstanceTypes: | ||
- r5.12xlarge:1 | ||
- r5d.12xlarge:2 | ||
- r6i.12xlarge:3 | ||
- r6id.12xlarge:4 | ||
- r7i.12xlarge:5 | ||
- r7iz.12xlarge:6 | ||
SpotFleetTypes: | ||
- r5.12xlarge:1 | ||
- r5d.12xlarge:2 | ||
- r6i.12xlarge:3 | ||
- r6id.12xlarge:4 | ||
- r7i.12xlarge:5 | ||
- r7iz.12xlarge:6 | ||
EnableHyperthreading: false | ||
- ProfileName: amd24core350g | ||
NodeGroupName: amd24core350g | ||
MaxControllers: 10 | ||
InstanceTypes: | ||
- r5a.12xlarge:1 | ||
- r5ad.12xlarge:2 | ||
- r6a.12xlarge:3 | ||
- r7a.12xlarge:5 | ||
SpotFleetTypes: | ||
- r5a.12xlarge:1 | ||
- r5ad.12xlarge:2 | ||
- r6a.12xlarge:3 | ||
- r7a.12xlarge:5 | ||
EnableHyperthreading: false | ||
Pools: | ||
- PoolName: intel-24core-350G | ||
ProfileName: intel24core350G | ||
ImageName: res-demo-pc-3-10-1-rhel8-x86 | ||
PoolSize: 10 | ||
CPUs: 24 | ||
MinMemory: 350000 | ||
MaxMemory: 350000 | ||
- PoolName: amd-24core-350G | ||
ProfileName: amd24core350G | ||
ImageName: res-demo-pc-3-10-1-rhel8-x86 | ||
PoolSize: 10 | ||
CPUs: 24 | ||
MinMemory: 350000 | ||
MaxMemory: 350000 | ||
``` | ||
|
||
### Create XIO Profiles | ||
|
||
In the EMS GUI copy the existing az1 profile to the profiles that you configured. | ||
The name is all that matters. | ||
The deployment will update the profile automatically from your configuration. | ||
|
||
|
||
### Create the Application Environment | ||
|
||
In the EMS GUI copy the **slurm** Application Environment to a new environment that is the same | ||
name as your ParallelCluster cluster. | ||
The deployment will update the application environment from your configuration. | ||
|
||
### Create an XIO ParallelCluster AMI | ||
|
||
Launch an instance using the base AMI for your OS. | ||
For example, launch an instance with a base RHEL 8 or Rocky 8 AMI. | ||
|
||
Mount the ParallelCluster NFS file system at /opt/slurm. | ||
|
||
Run the ansible playbook to configure the instance for XIO. | ||
|
||
``` | ||
/opt/slurm/config/bin/xio-compute-node-ami-configure.sh | ||
``` | ||
|
||
Do any additional configuration that you require such as configuring file system mounts and installing | ||
packages. | ||
|
||
Create an AMI from the instance and wait for it to become available. | ||
|
||
### Update the cluster with the XIO Iconfiguration | ||
|
||
Update the cluster with the XIO configuration. | ||
|
||
This will update the profiles and environment on the EMS server and configure the cluster for XIO. | ||
The only remaining step before you can submit jobs is to create the XIO VM image. | ||
|
||
### Create an XIO Image from the XIO ParallelCluster AMI | ||
|
||
Connect to the head node and create the XIO Image from the AMI you created. | ||
The IMAGE-NAME should be the same that you configured in the Pools. | ||
|
||
``` | ||
/opt/slurm/etc/exostellar/parse_helper.sh -a <AMI-ID1> -i <IMAGE-NAME> | ||
``` | ||
|
||
### Test launching an XIO VM | ||
|
||
Connect to the head node and test launching a VM. | ||
The pool, profile, and image_name should be from your configuration. | ||
The host name doesn't matter. | ||
|
||
``` | ||
/opt/slurm/etc/exostellar/teste_creasteVm.sh --pool <pool> --profile <profile> -i <image name> -h <host> | ||
``` | ||
|
||
### Run a test job using Slurm | ||
|
||
``` | ||
srun --pty -p xio- | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.