This module performs the following tasks:
- create an instance template from which execute points will be created
- create a managed instance group (MIG) for execute points
- create a Toolkit runner to configure the autoscaler to scale the MIG
It is expected to be used with the htcondor-install and htcondor-setup modules.
This module may be used multiple times in a blueprint to create sets of
execute points in an HTCondor pool. If used more than 1 time, the setting
name_prefix must be set to a value that is unique across
all uses of the htcondor-execute-point module. If you do not follow this
constraint, you will likely receive an error while running terraform apply
similar to that shown below.
Error: Invalid value for variable
on modules/embedded/community/modules/scheduler/htcondor-access-point/main.tf line 136, in module "startup_script":
136: runners = local.all_runners
├────────────────
│ var.runners is list of map of string with 5 elements
All startup-script runners must have a unique destination.
HTCondor access points provisioned by the Toolkit are specially configured to
honor an attribute named RequireId
in each Job ClassAd. This value
must be set to the ID of a MIG created by an instance of this module. The
htcondor-access-point module includes a setting var.default_mig_id
that will
set this value automatically to the MIG ID corresponding to the module's
execute points. If this setting is left unset each job must specify +RequireId
explicitly. In all cases, the default value can be overridden explicitly as shown
below:
universe = vanilla
executable = /bin/echo
arguments = "Hello, World!"
output = out.$(ClusterId).$(ProcId)
error = err.$(ClusterId).$(ProcId)
log = log.$(ClusterId).$(ProcId)
request_cpus = 1
request_memory = 100MB
+RequireId = "htcondor-pool-ep-mig"
queue
A full example can be found in the examples README.
The following code snippet creates a pool with 2 sets of HTCondor execute points, one using On-demand pricing and the other using Spot pricing. They use a startup script and network created in previous steps.
- id: htcondor_execute_point
source: community/modules/compute/htcondor-execute-point
use:
- network1
- htcondor_secrets
- htcondor_setup
- htcondor_cm
settings:
instance_image:
project: $(vars.project_id)
family: $(vars.new_image_family)
min_idle: 2
- id: htcondor_execute_point_spot
source: community/modules/compute/htcondor-execute-point
use:
- network1
- htcondor_secrets
- htcondor_setup
- htcondor_cm
settings:
instance_image:
project: $(vars.project_id)
family: $(vars.new_image_family)
spot: true
- id: htcondor_access
source: community/modules/scheduler/htcondor-access-point
use:
- network1
- htcondor_secrets
- htcondor_setup
- htcondor_cm
- htcondor_execute_point
- htcondor_execute_point_spot
settings:
default_mig_id: $(htcondor_execute_point.mig_id)
enable_public_ips: true
instance_image:
project: $(vars.project_id)
family: $(vars.new_image_family)
outputs:
- access_point_ips
- access_point_name
HTCondor is maintained by the Center for High Throughput Computing at the University of Wisconsin-Madison. Support for HTCondor is available via:
Regional MIGs are used to provision Execute Points. By default, VMs will be provisioned in any of the zones available in that region, however, it can be constrained to run in fewer zones (or a single zone) using var.zones.
When the configuration of an Execute Point is changed, the MIG can be configured to replace the VM using a "proactive" or "opportunistic" policy. By default, the policy is set to opportunistic. In practice, this means that Execute Points will NOT be automatically replaced by Terraform when changes to the instance template / HTCondor configuration are made. We recommend leaving this at the default value as it will allow the HTCondor autoscaler to replace VMs when they become idle without disrupting running jobs.
However, if it is desired var.update_policy can be set to "PROACTIVE" to enable automatic replacement. This will disrupt running jobs and send them back to the queue. Alternatively, one can leave the setting at the default value of "OPPORTUNISTIC" and update:
- intentionally by issuing an update via Cloud Console or using gcloud (below)
- VMs becomes unhealthy or are otherwise automatically replaced (e.g. regular Google Cloud maintenance)
For example, to manually update all instances in a MIG:
gcloud compute instance-groups managed update-instances \
<<NAME-OF-MIG>> --all-instances --region <<REGION>> \
--project <<PROJECT_ID>> --minimal-action replace
When using OS Login with "external users" (outside of the Google Cloud organization), then Docker universe jobs will fail and cause the Docker daemon to crash. This stems from the use of POSIX user ids (uid) outside the range supported by Docker. Please consider disabling OS Login if this atypical situation applies.
vars:
# add setting below to existing deployment variables
enable_oslogin: DISABLE
Copyright 2022 Google LLC
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Name | Version |
---|---|
terraform | >= 1.1 |
>= 4.0 |
Name | Version |
---|---|
>= 4.0 |
Name | Source | Version |
---|---|---|
execute_point_instance_template | terraform-google-modules/vm/google//modules/instance_template | 10.1.1 |
mig | terraform-google-modules/vm/google//modules/mig | 10.1.1 |
startup_script | github.com/GoogleCloudPlatform/hpc-toolkit//modules/scripts/startup-script | v1.35.0&depth=1 |
Name | Type |
---|---|
google_storage_bucket_object.execute_config | resource |
google_compute_image.htcondor | data source |
google_compute_zones.available | data source |
Name | Description | Type | Default | Required |
---|---|---|---|---|
central_manager_ips | List of IP addresses of HTCondor Central Managers | list(string) |
n/a | yes |
deployment_name | Cluster Toolkit deployment name. HTCondor cloud resource names will include this value. | string |
n/a | yes |
disk_size_gb | Boot disk size in GB | number |
100 |
no |
disk_type | Disk type for template | string |
"pd-balanced" |
no |
distribution_policy_target_shape | Target shape across zones for instance group managing execute points | string |
"ANY" |
no |
enable_oslogin | Enable or Disable OS Login with "ENABLE" or "DISABLE". Set to "INHERIT" to inherit project OS Login setting. | string |
"ENABLE" |
no |
enable_shielded_vm | Enable the Shielded VM configuration (var.shielded_instance_config). | bool |
false |
no |
execute_point_runner | A list of Toolkit runners for configuring an HTCondor execute point | list(map(string)) |
[] |
no |
execute_point_service_account_email | Service account for HTCondor execute point (e-mail format) | string |
n/a | yes |
guest_accelerator | List of the type and count of accelerator cards attached to the instance. | list(object({ |
[] |
no |
htcondor_bucket_name | Name of HTCondor configuration bucket | string |
n/a | yes |
instance_image | HTCondor execute point VM image Expected Fields: name: The name of the image. Mutually exclusive with family. family: The image family to use. Mutually exclusive with name. project: The project where the image is hosted. |
map(string) |
{ |
no |
labels | Labels to add to HTConodr execute points | map(string) |
n/a | yes |
machine_type | Machine type to use for HTCondor execute points | string |
"n2-standard-4" |
no |
max_size | Maximum size of the HTCondor execute point pool. | number |
5 |
no |
metadata | Metadata to add to HTCondor execute points | map(string) |
{} |
no |
min_idle | Minimum number of idle VMs in the HTCondor pool (if pool reaches var.max_size, this minimum is not guaranteed); set to ensure jobs beginning run more quickly. | number |
0 |
no |
name_prefix | Name prefix given to hostnames in this group of execute points; must be unique across all instances of this module | string |
n/a | yes |
network_self_link | The self link of the network HTCondor execute points will join | string |
"default" |
no |
network_storage | An array of network attached storage mounts to be configured | list(object({ |
[] |
no |
project_id | Project in which the HTCondor execute points will be created | string |
n/a | yes |
region | The region in which HTCondor execute points will be created | string |
n/a | yes |
service_account_scopes | Scopes by which to limit service account attached to central manager. | set(string) |
[ |
no |
shielded_instance_config | Shielded VM configuration for the instance (must set var.enabled_shielded_vm) | object({ |
{ |
no |
spot | Provision VMs using discounted Spot pricing, allowing for preemption | bool |
false |
no |
subnetwork_self_link | The self link of the subnetwork HTCondor execute points will join | string |
null |
no |
target_size | Initial size of the HTCondor execute point pool; set to null (default) to avoid Terraform management of size. | number |
null |
no |
update_policy | Replacement policy for Access Point Managed Instance Group ("PROACTIVE" to replace immediately or "OPPORTUNISTIC" to replace upon instance power cycle) | string |
"OPPORTUNISTIC" |
no |
windows_startup_ps1 | Startup script to run at boot-time for Windows-based HTCondor execute points | list(string) |
[] |
no |
zones | Zone(s) in which execute points may be created. If not supplied, will default to all zones in var.region. | list(string) |
[] |
no |
Name | Description |
---|---|
autoscaler_runner | Toolkit runner to configure the HTCondor autoscaler |
mig_id | ID of the managed instance group containing the execute points |