Skip to content

Commit

Permalink
Add exostellar infrastructure optimizer playbook
Browse files Browse the repository at this point in the history
Configure XIO

Resolves #226
  • Loading branch information
cartalla committed Sep 26, 2024
1 parent 4a3c78c commit 75e6722
Show file tree
Hide file tree
Showing 20 changed files with 1,356 additions and 1 deletion.
5 changes: 5 additions & 0 deletions docs/deployment-prerequisites.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,11 @@ when you create the file system.
If the file system already exists, then attach the appropriate security group to the network interfaces of
the file systems.

## Create Exostellar Management Server

If you're going to use Exostellar Infrastructure Optimizer (XIO) then you will need to deploy the Exostellar management server.
See the [XIO page](../exostellar-infrastructure-optimizer) for details.

## Create Configuration File

Before you deploy a cluster you need to create a configuration file.
Expand Down
33 changes: 33 additions & 0 deletions docs/exostellar-infrastructure-optimizer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Exostellar Infrastructure Optimizer

[Exostellar Infrastructure Optimizer](https://exostellar.io/infrastructureoptimizer-technical-information/) (XIO) runs applications in virtual machines (VMs) on EC2 instances and dynamically relocates the VMs between instances based on availability and cost.
Long-running, stateful jobs cannot normally be run on spot instances because they can't be restarted after a spot termination.
XIO reduces this risk by predicting spot terminations and dynamically relocating the VM to an on-demand instance.
When spot capacity becomes available again, the VM can be migrated back to a spot instance.
This allows you to save up to 90% over on-demand pricing by running on spot when capacity is available.

## XIO Configuration

Refer to [Exostellar's documentation](https://docs.exostellar.io/latest/Latest/HPC-User/getting-started-installation) to make sure you have the latest instructions.

### Create IAM permissions stack

[Create the EC2 instances profiles](https://docs.exostellar.io/latest/Latest/HPC-User/getting-ready-prerequisites#GettingReady:Prerequisites-EC2InstanceProfiles).

* Download the CloudFormation template
* Create a stack using the template

### Install the Management Server

[Install the management server](https://docs.exostellar.io/latest/Latest/HPC-User/installing-management-server)

For the shared security group id use the SlurmLoginNodeSGId so that it has access to the Slurm head node.

### Configure Slurm

```
export MGMT_SERVER=10.4.130.5
export SLURM_CONF_DIR=/opt/slurm/res-eda-pc-3-10-1-rhel8-x86/etc
"I2Nsb3VkLWNvbmZpZwpydW5jbWQ6CiAgLSBbc2gsIC1jLCAibWtkaXIgLXAgL3hjb21wdXRlIl0KICAtIFtzaCwgLWMsICJtb3VudCAxNzIuMzEuMjQuNToveGNvbXB1dGUgL3hjb21wdXRlIl0KICAtIFtzaCwgLWMsICJta2RpciAtcCAvaG9tZS9zbHVybSJdCiAgLSBbc2gsIC1jLCAibW91bnQgMTcyLjMxLjI0LjU6L2hvbWUvc2x1cm0gL2hvbWUvc2x1cm0iXQogIC0gW3NoLCAtYywgInJtIC1yZiAvZXRjL3NsdXJtIl0KICAtIFtzaCwgLWMsICJsbiAtcyAveGNvbXB1dGUvc2x1cm0vIC9ldGMvc2x1cm0iXQogIC0gW3NoLCAtYywgImNwIC94Y29tcHV0ZS9zbHVybS9tdW5nZS5rZXkgL2V0Yy9tdW5nZS9tdW5nZS5rZXkiXQogIC0gW3NoLCAtYywgInN5c3RlbWN0bCByZXN0YXJ0IG11bmdlIl0KICAjIEFMV0FZUyBMQVNUIQogIC0gWwogICAgICBzaCwKICAgICAgLWMsCiAgICAgICJlY2hvIFhTUE9UX05PREVOQU1FID4gL3Zhci9ydW4vbm9kZW5hbWU7IHNjb250cm9sIHVwZGF0ZSBub2RlbmFtZT1YU1BPVF9OT0RFTkFNRSBub2RlYWRkcj1gaG9zdG5hbWUgLUlgIiwKICAgIF0KCg=="
```
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ nav:
- 'config.md'
- 'res_integration.md'
- 'soca_integration.md'
- 'exostellar-infrastructure-optimizer.md'
- 'custom-amis.md'
- 'run_jobs.md'
- 'job_preemption.md'
Expand Down
133 changes: 133 additions & 0 deletions source/cdk/cdk_slurm_stack.py
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,9 @@ def check_config(self):
if 'RESStackName' in self.config:
self.update_config_for_res()

if 'Xio' in self.config['slurm']:
self.update_config_for_exostellar()

if 'ErrorSnsTopicArn' not in self.config:
logger.warning(f"ErrorSnsTopicArn not set. Provide error-sns-topic-arn on the command line or ErrorSnsTopicArn in the config file to get error notifications.")

Expand Down Expand Up @@ -467,6 +470,11 @@ def check_config(self):
logger.error(f"Configured SlurmCtl instance type ({slurmctl_instance_type}) has {slurmctl_memory_in_gb} GB and needs at least 4.")
config_errors += 1

if 'Xio' in self.config['slurm']:
if self.config['slurm']['ParallelClusterConfig']['Architecture'] != 'x86_64':
logger.error("Xio is only supported on x86_64 architecture, not {self.config['slurm']['ParallelClusterConfig']['Architecture']}")
config_errors += 1

if config_errors:
exit(1)

Expand Down Expand Up @@ -783,6 +791,121 @@ def update_config_for_res(self):
self.config['slurm']['InstanceConfig']['AdditionalSecurityGroups'].append(res_home_mount_sg_id)
logger.info(f"Added slurm/InstanceConfig/AdditionalSecurityGroups={res_home_mount_sg_id}")

def update_config_for_exostellar(self):
'''
Update config with information from RES stacks
Add login node security groups.
Configure /home file system.
'''
logger.info(f"Updating configuration for Exostellar")
ems_stack_name = self.config['slurm']['Xio']['ManagementServerStackName']
logger.info(f" stack: {ems_stack_name}")

# Get RES environment name from stack parameters.
cloudformation_client = boto3.client('cloudformation', region_name=self.config['Region'])
ems_stack_name_found = False
stack_statuses = {}
stack_dicts = {}
list_stacks_paginator = cloudformation_client.get_paginator('list_stacks')
list_stacks_kwargs = {
'StackStatusFilter': [
'CREATE_COMPLETE',
'ROLLBACK_COMPLETE',
'UPDATE_COMPLETE',
'UPDATE_ROLLBACK_COMPLETE',
'IMPORT_COMPLETE',
'IMPORT_ROLLBACK_COMPLETE'
]
}
for list_stacks_response in list_stacks_paginator.paginate(**list_stacks_kwargs):
for stack_dict in list_stacks_response["StackSummaries"]:
stack_name = stack_dict['StackName']
if stack_name == ems_stack_name:
ems_stack_name_found = True
# Don't break here so get all of the stack names
stack_status = stack_dict['StackStatus']
stack_statuses[stack_name] = stack_status
stack_dicts[stack_name] = stack_dict
if not ems_stack_name_found:
message = f"CloudFormation EMS stack named {ems_stack_name} not found. Existing stacks:"
for stack_name in sorted(stack_statuses):
message += f"\n {stack_name:32}: status={stack_statuses[stack_name]}"
logger.error(message)
exit(1)

# Get values from stack parameters
stack_parameters = cloudformation_client.describe_stacks(StackName=ems_stack_name)['Stacks'][0]['Parameters']
ems_vpc_id = None
ems_subnet_id = None
for stack_parameter_dict in stack_parameters:
if stack_parameter_dict['ParameterKey'] == 'VPCId':
ems_vpc_id = stack_parameter_dict['ParameterValue']
elif stack_parameter_dict['ParameterKey'] == 'SubnetId':
ems_subnet_id = stack_parameter_dict['ParameterValue']
if not ems_vpc_id:
logger.error(f"VPCId parameter not found in {ems_stack_name} EMS stack.")
exit(1)
if not ems_subnet_id:
logger.error(f"SubnetId parameter not found in {ems_stack_name} EMS stack.")
exit(1)
if self.config['VpcId'] != ems_vpc_id:
logger.error(f"Config file VpcId={self.config['VpcId']} is not the same as EMS {ems_stack_name} VPCId={ems_vpc_id}.")
exit(1)

# Get values from stack resources
exostellar_role = None
exostellar_instance_profile = None
exostellar_security_group = None
list_stack_resources_paginator = cloudformation_client.get_paginator('list_stack_resources')
for stack_resource_summaries in list_stack_resources_paginator.paginate(StackName=ems_stack_name):
for stack_resource_summary_dict in stack_resource_summaries['StackResourceSummaries']:
if stack_resource_summary_dict['LogicalResourceId'] == 'ExostellarRole':
exostellar_role = stack_resource_summary_dict['PhysicalResourceId']
elif stack_resource_summary_dict['LogicalResourceId'] == 'ExostellarInstanceProfile':
exostellar_instance_profile = stack_resource_summary_dict['PhysicalResourceId']
elif stack_resource_summary_dict['LogicalResourceId'] == 'ExostellarSecurityGroup':
exostellar_security_group = stack_resource_summary_dict['PhysicalResourceId']
if exostellar_role and exostellar_instance_profile and exostellar_security_group:
break
if exostellar_role and exostellar_instance_profile and exostellar_security_group:
break
if not exostellar_role:
logger.error(f"ExostellarRole resource not found in {ems_stack_name} EMS stack")
exit(1)
if not exostellar_instance_profile:
logger.error(f"ExostellarInstanceProfile resource not found in {ems_stack_name} EMS stack")
exit(1)
if not exostellar_security_group:
logger.error(f"ExostellarSecurityGroup resource not found in {ems_stack_name} EMS stack")
exit(1)
self.config['slurm']['Xio']['ControllerIdentityRole'] = f"arn:{Aws.PARTITION}:iam::{Aws.ACCOUNT_ID}:instance-profile/{exostellar_instance_profile}"
self.config['slurm']['Xio']['WorkerIdentityRole'] = f"arn:{Aws.PARTITION}:iam::{Aws.ACCOUNT_ID}:instance-profile/{exostellar_instance_profile}"
if 'ControllerSecurityGroupIds' not in self.config['slurm']['Xio']:
self.config['slurm']['Xio']['ControllerSecurityGroupIds'] = []
if 'WorkerSecurityGroupIds' not in self.config['slurm']['Xio']:
self.config['slurm']['Xio']['WorkerSecurityGroupIds'] = []
if exostellar_security_group not in self.config['slurm']['Xio']['ControllerSecurityGroupIds']:
self.config['slurm']['Xio']['ControllerSecurityGroupIds'].append(exostellar_security_group)
if exostellar_security_group not in self.config['slurm']['Xio']['WorkerSecurityGroupIds']:
self.config['slurm']['Xio']['WorkerSecurityGroupIds'].append(exostellar_security_group)
if self.slurm_login_node_sg_id:
if self.slurm_login_node_sg_id not in self.config['slurm']['Xio']['ControllerSecurityGroupIds']:
self.config['slurm']['Xio']['ControllerSecurityGroupIds'].append(self.slurm_login_node_sg_id)
if self.slurm_login_node_sg_id not in self.config['slurm']['Xio']['WorkerSecurityGroupIds']:
self.config['slurm']['Xio']['WorkerSecurityGroupIds'].append(self.slurm_login_node_sg_id)

# Get values from stack outputs
ems_ip_address = None
stack_output_dicts = cloudformation_client.describe_stacks(StackName=ems_stack_name)['Stacks'][0]['Outputs']
for stack_output_dict in stack_output_dicts:
if stack_output_dict['OutputKey'] == '2ExostellarMgmtServerPrivateIP':
ems_ip_address = stack_output_dict['OutputValue']
if not ems_ip_address:
logger.error(f"2ExostellarMgmtServerPrivateIP output not found in {ems_stack_name} EMS stack.")
exit(1)
self.config['slurm']['Xio']['ManagementServerIp'] = ems_ip_address

def create_parallel_cluster_assets(self):
# Create a secure hash of all of the assets so that changes can be easily detected to trigger cluster updates.
self.assets_hash = sha512()
Expand Down Expand Up @@ -1993,6 +2116,16 @@ def get_instance_template_vars(self, instance_role):
instance_template_vars['slurmrestd_socket_dir'] = '/opt/slurm/com'
instance_template_vars['slurmrestd_socket'] = f"{instance_template_vars['slurmrestd_socket_dir']}/slurmrestd.socket"
instance_template_vars['slurmrestd_uid'] = self.config['slurm']['SlurmCtl']['SlurmrestdUid']
if 'Xio' in self.config['slurm']:
instance_template_vars['xio_mgt_ip'] = self.config['slurm']['Xio']['ManagementServerIp']
instance_template_vars['xio_availability_zone'] = self.config['slurm']['Xio']['AvailabilityZone']
instance_template_vars['xio_controller_identity_role'] = self.config['slurm']['Xio']['ControllerIdentityRole']
instance_template_vars['xio_controller_security_group_ids'] = self.config['slurm']['Xio']['ControllerSecurityGroupIds']
instance_template_vars['subnet_id'] = self.config['SubnetId']
instance_template_vars['xio_controller_image_id'] = self.config['slurm']['Xio']['ControllerImageId']
instance_template_vars['xio_worker_identity_role'] = self.config['slurm']['Xio']['WorkerIdentityRole']
instance_template_vars['xio_worker_security_group_ids'] = self.config['slurm']['Xio']['WorkerSecurityGroupIds']
instance_template_vars['xio_worker_image_id'] = self.config['slurm']['Xio']['WorkerImageId']
elif instance_role == 'ParallelClusterSubmitter':
instance_template_vars['slurm_version'] = get_SLURM_VERSION(self.config)
instance_template_vars['parallel_cluster_munge_version'] = get_PARALLEL_CLUSTER_MUNGE_VERSION(self.config)
Expand Down
12 changes: 12 additions & 0 deletions source/cdk/config_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -752,6 +752,18 @@ def get_config_schema(config):
Optional('Partition', default='onprem'): str,
}
},
Optional('Xio'): {
Optional('ManagementServerStackName'): str,
Optional('ManagementServerImageId'): str,
Optional('ManagementServerIp'): str,
Optional('AvailabilityZone'): str,
Optional('ControllerIdentityRole'): str,
Optional('ControllerSecurityGroupIds'): [ str ],
Optional('ControllerImageId'): str,
Optional('WorkerIdentityRole'): str,
Optional('WorkerSecurityGroupIds'): [ str ],
Optional('WorkerImageId'): str
},
Optional('SlurmUid', default=401): int,
Optional('storage'): {
#
Expand Down
2 changes: 2 additions & 0 deletions source/resources/playbooks/ParallelClusterHeadNode.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,7 @@
roles:
- all
- ParallelClusterHeadNode
- role: exostellar_infrastructure_optimizer
when: xio_mgt_ip is defined
- security_updates
- bug_fixes
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
exostellar_infrastructure_optimizer
=========

Configure Exostellar Infrastructure Optimizer (XIO).

This is run on the Slurm head node and uploads configuration files to the XIO managment server using curl commands.

Requirements
------------

Requires root permissions so that it can install the packages required by slurm.

Role Variables
--------------
cluster_name
xio_mgt_ip
Loading

0 comments on commit 75e6722

Please sign in to comment.