Skip to content

Latest commit

 

History

History
266 lines (214 loc) · 15.1 KB

File metadata and controls

266 lines (214 loc) · 15.1 KB

Description

This module performs the following tasks:

  • create an instance template from which execute points will be created
  • create a managed instance group (MIG) for execute points
  • create a Toolkit runner to configure the autoscaler to scale the MIG

It is expected to be used with the htcondor-install and htcondor-setup modules.

Known limitations

This module may be used multiple times in a blueprint to create sets of execute points in an HTCondor pool. If used more than 1 time, the setting name_prefix must be set to a value that is unique across all uses of the htcondor-execute-point module. If you do not follow this constraint, you will likely receive an error while running terraform apply similar to that shown below.

Error: Invalid value for variable

  on modules/embedded/community/modules/scheduler/htcondor-access-point/main.tf line 136, in module "startup_script":
 136:   runners = local.all_runners
    ├────────────────
    │ var.runners is list of map of string with 5 elements

All startup-script runners must have a unique destination.

How to configure jobs to select execute points

HTCondor access points provisioned by the Toolkit are specially configured to honor an attribute named RequireId in each Job ClassAd. This value must be set to the ID of a MIG created by an instance of this module. The htcondor-access-point module includes a setting var.default_mig_id that will set this value automatically to the MIG ID corresponding to the module's execute points. If this setting is left unset each job must specify +RequireId explicitly. In all cases, the default value can be overridden explicitly as shown below:

universe       = vanilla
executable     = /bin/echo
arguments      = "Hello, World!"
output         = out.$(ClusterId).$(ProcId)
error          = err.$(ClusterId).$(ProcId)
log            = log.$(ClusterId).$(ProcId)
request_cpus   = 1
request_memory = 100MB
+RequireId     = "htcondor-pool-ep-mig"
queue

Example

A full example can be found in the examples README.

The following code snippet creates a pool with 2 sets of HTCondor execute points, one using On-demand pricing and the other using Spot pricing. They use a startup script and network created in previous steps.

- id: htcondor_execute_point
  source: community/modules/compute/htcondor-execute-point
  use:
  - network1
  - htcondor_secrets
  - htcondor_setup
  - htcondor_cm
  settings:
    instance_image:
      project: $(vars.project_id)
      family: $(vars.new_image_family)
    min_idle: 2

- id: htcondor_execute_point_spot
  source: community/modules/compute/htcondor-execute-point
  use:
  - network1
  - htcondor_secrets
  - htcondor_setup
  - htcondor_cm
  settings:
    instance_image:
      project: $(vars.project_id)
      family: $(vars.new_image_family)
    spot: true

- id: htcondor_access
  source: community/modules/scheduler/htcondor-access-point
  use:
  - network1
  - htcondor_secrets
  - htcondor_setup
  - htcondor_cm
  - htcondor_execute_point
  - htcondor_execute_point_spot
  settings:
    default_mig_id: $(htcondor_execute_point.mig_id)
    enable_public_ips: true
    instance_image:
      project: $(vars.project_id)
      family: $(vars.new_image_family)
  outputs:
  - access_point_ips
  - access_point_name

Support

HTCondor is maintained by the Center for High Throughput Computing at the University of Wisconsin-Madison. Support for HTCondor is available via:

Behavior of Managed Instance Group (MIG)

Regional MIGs are used to provision Execute Points. By default, VMs will be provisioned in any of the zones available in that region, however, it can be constrained to run in fewer zones (or a single zone) using var.zones.

When the configuration of an Execute Point is changed, the MIG can be configured to replace the VM using a "proactive" or "opportunistic" policy. By default, the policy is set to opportunistic. In practice, this means that Execute Points will NOT be automatically replaced by Terraform when changes to the instance template / HTCondor configuration are made. We recommend leaving this at the default value as it will allow the HTCondor autoscaler to replace VMs when they become idle without disrupting running jobs.

However, if it is desired var.update_policy can be set to "PROACTIVE" to enable automatic replacement. This will disrupt running jobs and send them back to the queue. Alternatively, one can leave the setting at the default value of "OPPORTUNISTIC" and update:

  • intentionally by issuing an update via Cloud Console or using gcloud (below)
  • VMs becomes unhealthy or are otherwise automatically replaced (e.g. regular Google Cloud maintenance)

For example, to manually update all instances in a MIG:

gcloud compute instance-groups managed update-instances \
   <<NAME-OF-MIG>> --all-instances --region <<REGION>> \
   --project <<PROJECT_ID>> --minimal-action replace

Known Issues

When using OS Login with "external users" (outside of the Google Cloud organization), then Docker universe jobs will fail and cause the Docker daemon to crash. This stems from the use of POSIX user ids (uid) outside the range supported by Docker. Please consider disabling OS Login if this atypical situation applies.

vars:
  # add setting below to existing deployment variables
  enable_oslogin: DISABLE

License

Copyright 2022 Google LLC

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Requirements

Name Version
terraform >= 1.1
google >= 4.0

Providers

Name Version
google >= 4.0

Modules

Name Source Version
execute_point_instance_template terraform-google-modules/vm/google//modules/instance_template 10.1.1
mig terraform-google-modules/vm/google//modules/mig 10.1.1
startup_script github.com/GoogleCloudPlatform/hpc-toolkit//modules/scripts/startup-script v1.35.0&depth=1

Resources

Name Type
google_storage_bucket_object.execute_config resource
google_compute_image.htcondor data source
google_compute_zones.available data source

Inputs

Name Description Type Default Required
central_manager_ips List of IP addresses of HTCondor Central Managers list(string) n/a yes
deployment_name Cluster Toolkit deployment name. HTCondor cloud resource names will include this value. string n/a yes
disk_size_gb Boot disk size in GB number 100 no
disk_type Disk type for template string "pd-balanced" no
distribution_policy_target_shape Target shape across zones for instance group managing execute points string "ANY" no
enable_oslogin Enable or Disable OS Login with "ENABLE" or "DISABLE". Set to "INHERIT" to inherit project OS Login setting. string "ENABLE" no
enable_shielded_vm Enable the Shielded VM configuration (var.shielded_instance_config). bool false no
execute_point_runner A list of Toolkit runners for configuring an HTCondor execute point list(map(string)) [] no
execute_point_service_account_email Service account for HTCondor execute point (e-mail format) string n/a yes
guest_accelerator List of the type and count of accelerator cards attached to the instance.
list(object({
type = string,
count = number
}))
[] no
htcondor_bucket_name Name of HTCondor configuration bucket string n/a yes
instance_image HTCondor execute point VM image

Expected Fields:
name: The name of the image. Mutually exclusive with family.
family: The image family to use. Mutually exclusive with name.
project: The project where the image is hosted.
map(string)
{
"family": "hpc-rocky-linux-8",
"project": "cloud-hpc-image-public"
}
no
labels Labels to add to HTConodr execute points map(string) n/a yes
machine_type Machine type to use for HTCondor execute points string "n2-standard-4" no
max_size Maximum size of the HTCondor execute point pool. number 5 no
metadata Metadata to add to HTCondor execute points map(string) {} no
min_idle Minimum number of idle VMs in the HTCondor pool (if pool reaches var.max_size, this minimum is not guaranteed); set to ensure jobs beginning run more quickly. number 0 no
name_prefix Name prefix given to hostnames in this group of execute points; must be unique across all instances of this module string n/a yes
network_self_link The self link of the network HTCondor execute points will join string "default" no
network_storage An array of network attached storage mounts to be configured
list(object({
server_ip = string,
remote_mount = string,
local_mount = string,
fs_type = string,
mount_options = string,
client_install_runner = map(string)
mount_runner = map(string)
}))
[] no
project_id Project in which the HTCondor execute points will be created string n/a yes
region The region in which HTCondor execute points will be created string n/a yes
service_account_scopes Scopes by which to limit service account attached to central manager. set(string)
[
"https://www.googleapis.com/auth/cloud-platform"
]
no
shielded_instance_config Shielded VM configuration for the instance (must set var.enabled_shielded_vm)
object({
enable_secure_boot = bool
enable_vtpm = bool
enable_integrity_monitoring = bool
})
{
"enable_integrity_monitoring": true,
"enable_secure_boot": true,
"enable_vtpm": true
}
no
spot Provision VMs using discounted Spot pricing, allowing for preemption bool false no
subnetwork_self_link The self link of the subnetwork HTCondor execute points will join string null no
target_size Initial size of the HTCondor execute point pool; set to null (default) to avoid Terraform management of size. number null no
update_policy Replacement policy for Access Point Managed Instance Group ("PROACTIVE" to replace immediately or "OPPORTUNISTIC" to replace upon instance power cycle) string "OPPORTUNISTIC" no
windows_startup_ps1 Startup script to run at boot-time for Windows-based HTCondor execute points list(string) [] no
zones Zone(s) in which execute points may be created. If not supplied, will default to all zones in var.region. list(string) [] no

Outputs

Name Description
autoscaler_runner Toolkit runner to configure the HTCondor autoscaler
mig_id ID of the managed instance group containing the execute points