Skip to content

Commit

Permalink
Add exostellar infrastructure optimizer playbook
Browse files Browse the repository at this point in the history
Configure XIO

Resolves #226
  • Loading branch information
cartalla committed Sep 17, 2024
1 parent 55517db commit c052232
Show file tree
Hide file tree
Showing 15 changed files with 957 additions and 0 deletions.
33 changes: 33 additions & 0 deletions docs/exostellar-infrastructure-optimizer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Exostellar Infrastructure Optimizer

[Exostellar Infrastructure Optimizer](https://exostellar.io/infrastructureoptimizer-technical-information/) (XIO) runs applications in virtual machines (VMs) on EC2 instances and dynamically relocates the VMs between instances based on availability and cost.
Long-running, stateful jobs cannot normally be run on spot instances because they can't be restarted after a spot termination.
XIO reduces this risk by predicting spot terminations and dynamically relocating the VM to an on-demand instance.
When spot capacity becomes available again, the VM can be migrated back to a spot instance.
This allows you to save up to 90% over on-demand pricing by running on spot when capacity is available.

## XIO Configuration

Refer to [Exostellar's documentation](https://docs.exostellar.io/latest/Latest/HPC-User/getting-started-installation) to make sure you have the latest instructions.

### Create IAM permissions stack

[Create the EC2 instances profiles](https://docs.exostellar.io/latest/Latest/HPC-User/getting-ready-prerequisites#GettingReady:Prerequisites-EC2InstanceProfiles).

* Download the CloudFormation template
* Create a stack using the template

### Install the Management Server

[Install the management server](https://docs.exostellar.io/latest/Latest/HPC-User/installing-management-server)

For the shared security group id use the SlurmLoginNodeSGId so that it has access to the Slurm head node.

### Configure Slurm

```
export MGMT_SERVER=10.4.130.5
export SLURM_CONF_DIR=/opt/slurm/res-eda-pc-3-10-1-rhel8-x86/etc
"I2Nsb3VkLWNvbmZpZwpydW5jbWQ6CiAgLSBbc2gsIC1jLCAibWtkaXIgLXAgL3hjb21wdXRlIl0KICAtIFtzaCwgLWMsICJtb3VudCAxNzIuMzEuMjQuNToveGNvbXB1dGUgL3hjb21wdXRlIl0KICAtIFtzaCwgLWMsICJta2RpciAtcCAvaG9tZS9zbHVybSJdCiAgLSBbc2gsIC1jLCAibW91bnQgMTcyLjMxLjI0LjU6L2hvbWUvc2x1cm0gL2hvbWUvc2x1cm0iXQogIC0gW3NoLCAtYywgInJtIC1yZiAvZXRjL3NsdXJtIl0KICAtIFtzaCwgLWMsICJsbiAtcyAveGNvbXB1dGUvc2x1cm0vIC9ldGMvc2x1cm0iXQogIC0gW3NoLCAtYywgImNwIC94Y29tcHV0ZS9zbHVybS9tdW5nZS5rZXkgL2V0Yy9tdW5nZS9tdW5nZS5rZXkiXQogIC0gW3NoLCAtYywgInN5c3RlbWN0bCByZXN0YXJ0IG11bmdlIl0KICAjIEFMV0FZUyBMQVNUIQogIC0gWwogICAgICBzaCwKICAgICAgLWMsCiAgICAgICJlY2hvIFhTUE9UX05PREVOQU1FID4gL3Zhci9ydW4vbm9kZW5hbWU7IHNjb250cm9sIHVwZGF0ZSBub2RlbmFtZT1YU1BPVF9OT0RFTkFNRSBub2RlYWRkcj1gaG9zdG5hbWUgLUlgIiwKICAgIF0KCg=="
```
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ nav:
- 'config.md'
- 'res_integration.md'
- 'soca_integration.md'
- 'exostellar-infrastructure-optimizer.md'
- 'custom-amis.md'
- 'run_jobs.md'
- 'job_preemption.md'
Expand Down
7 changes: 7 additions & 0 deletions source/cdk/cdk_slurm_stack.py
Original file line number Diff line number Diff line change
Expand Up @@ -467,6 +467,11 @@ def check_config(self):
logger.error(f"Configured SlurmCtl instance type ({slurmctl_instance_type}) has {slurmctl_memory_in_gb} GB and needs at least 4.")
config_errors += 1

if 'Xio' in self.config['slurm']:
if self.config['slurm']['ParallelClusterConfig']['Architecture'] != 'x86_64':
logger.error("Xio is only supported on x86_64 architecture, not {self.config['slurm']['ParallelClusterConfig']['Architecture']}")
config_errors += 1

if config_errors:
exit(1)

Expand Down Expand Up @@ -1993,6 +1998,8 @@ def get_instance_template_vars(self, instance_role):
instance_template_vars['slurmrestd_socket_dir'] = '/opt/slurm/com'
instance_template_vars['slurmrestd_socket'] = f"{instance_template_vars['slurmrestd_socket_dir']}/slurmrestd.socket"
instance_template_vars['slurmrestd_uid'] = self.config['slurm']['SlurmCtl']['SlurmrestdUid']
if 'Xio' in self.config['slurm']:
instance_template_vars['xio_mgt_ip'] = self.config['slurm']['Xio']['ManagementServerIp']
elif instance_role == 'ParallelClusterSubmitter':
instance_template_vars['slurm_version'] = get_SLURM_VERSION(self.config)
instance_template_vars['parallel_cluster_munge_version'] = get_PARALLEL_CLUSTER_MUNGE_VERSION(self.config)
Expand Down
5 changes: 5 additions & 0 deletions source/cdk/config_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -752,6 +752,11 @@ def get_config_schema(config):
Optional('Partition', default='onprem'): str,
}
},
Optional('Xio'): {
Optional('ManagementServerStackName'): str,
Optional('ManagementServerImageId'): str,
Optional('ManagementServerIp'): str
},
Optional('SlurmUid', default=401): int,
Optional('storage'): {
#
Expand Down
2 changes: 2 additions & 0 deletions source/resources/playbooks/ParallelClusterHeadNode.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,7 @@
roles:
- all
- ParallelClusterHeadNode
- role: exostellar_infrastructure_optimizer
when: xio_mgt_ip is defined
- security_updates
- bug_fixes
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
exostellar_infrastructure_optimizer
=========

Configure Exostellar Infrastructure Optimizer (XIO).

This is run on the Slurm head node and uploads configuration files to the XIO managment server using curl commands.

Requirements
------------

Requires root permissions so that it can install the packages required by slurm.

Role Variables
--------------
cluster_name
xio_mgt_ip
Loading

0 comments on commit c052232

Please sign in to comment.