Add a3u-gke-gcs blueprint #3454

samskillman · 2024-12-21T06:42:17Z

In addition to the blueprint, which gives an opinionated way to mount buckets for training and checkpointing, I modified the gke-persistent-volume to be able to use the mount_options specified in network_storage. It previously just hardcoded implicit-dirs.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

Fork your PR branch from the Toolkit "develop" branch (not main)
Test all changes with pre-commit in a local branch #
Confirm that "make tests" passes all tests
Add or modify unit tests to cover code changes
Ensure that unit test coverage remains above 80%
Update all applicable documentation
Follow Cluster Toolkit Contribution guidelines #

ighosh98 · 2024-12-21T09:05:14Z

examples/hypercompute_clusters/a3u-gke-gcs/a3u-gke-gcs.yaml

+  # Cidr block containing the IP of the machine calling terraform and kubectl
+  # The value can be more specific if the IPs are known which will run kubectl
+  # e.g. the local system running Terraform or a remote node
+  authorized_cidr: 0.0.0.0/0


Should this be kept as the default value? instead of the traditional
<your-ip-address>/32 in other blueprints?

I feel keeping it as /32 can give user a better consistent experience.
they can always set it to 0.0.0.0/0 to open it up (maybe have a comment in the blueprint for this?)

ighosh98 · 2024-12-21T09:07:01Z

examples/hypercompute_clusters/a3u-gke-gcs/a3u-gke-gcs.yaml

+
+  # Install Kueue, Jobset, and NCCL installer
+  - id: workload-manager-install
+    source: github.com/GoogleCloudPlatform/cluster-toolkit.git//modules/management/kubectl-apply?ref=e0c690b


Kueue v0.10.0 is the recommended solution for A3 Ultra GA. Please consider modifying this piece in line with that in the latest blueprint

ighosh98 · 2024-12-21T09:11:09Z

examples/hypercompute_clusters/a3u-gke-gcs/kueue-configuration.yaml.tftpl

+      resources:
+      - name: "nvidia.com/gpu"
+        nominalQuota: ${num_gpus}
+


In line with using Kueue v0.10.0 as the recommended solution, please modify this configuration file to be similar to the the tas-queues.yaml file

Also, have we tried using this configuration to provision a cluster and schedule a workload?

ankitkinra · 2024-12-22T22:44:04Z

examples/hypercompute_clusters/a3u-gke-gcs/a3u-gke-gcs.yaml

+        subnet_ip: 192.168.64.0/18
+
+  - id: gke-a3-ultra-rdma-net
+    source: github.com/GoogleCloudPlatform/cluster-toolkit.git//community/modules/network/rdma-vpc?ref=98c49fe


In the latest PR #3456. If this needs to be on the main branch , we will remove the refs and use "versioned blueprints" to refer to the latest develop branch

examples/hypercompute_clusters/a3u-gke-gcs/a3u-gke-gcs.yaml

chengcongdu · 2024-12-26T17:01:23Z

examples/hypercompute_clusters/a3u-gke-gcs/a3u-gke-gcs.yaml

+  # Cidr block containing the IP of the machine calling terraform and kubectl
+  # The value can be more specific if the IPs are known which will run kubectl
+  # e.g. the local system running Terraform or a remote node
+  authorized_cidr: 0.0.0.0/0


I feel keeping it as /32 can give user a better consistent experience.
they can always set it to 0.0.0.0/0 to open it up (maybe have a comment in the blueprint for this?)

chengcongdu · 2024-12-26T17:05:25Z

examples/hypercompute_clusters/a3u-gke-gcs/a3u-gke-gcs.yaml

+  training_bucket_name: # Name of bucket that holds training data
+  checkpoint_bucket_name: # Name of bucket used for checkpoints
+  system_node_pool_disk_size_gb: 200
+  a3ultra_node_pool_disk_size_gb: 100


100G seems too low for AI/ML images, this could lead to disk out of space issue and pod stuck in pulling image

we seen this kind of issue with 200G space before, and end up using 500

chengcongdu · 2024-12-26T17:12:01Z

examples/hypercompute_clusters/a3u-gke-gcs/README.md

+* **Cloud Storage Fuse Integration:** Enables seamless access to GCS buckets
+  from within your containers using the Cloud Storage Fuse CSI Driver. Cloud
+  Storage Fuse is configured to utilize the 12 TB of Local SSD
+* **Hierarchical Namespace Buckets:** Leverages GCS buckets with Hierarchical


GCS HNS only supported by GKE 1.31 and later

samskillman requested review from ankitkinra and tpdownes December 21, 2024 06:42

samskillman added the release-key-new-features Added to release notes under the "Key New Features" heading. label Dec 21, 2024

Initial a3u-gke-gcs example

c65f2cc

samskillman force-pushed the examples/a3u-gke-gcs branch from ffc05d1 to c65f2cc Compare December 21, 2024 06:47

samskillman added 2 commits December 21, 2024 07:35

Pass mount_options through to GKE PV

25ccf0d

Use folded style for mount options

9d6362f

samskillman requested a review from cboneti December 21, 2024 07:46

ighosh98 reviewed Dec 21, 2024

View reviewed changes

ankitkinra reviewed Dec 22, 2024

View reviewed changes

ankitkinra requested a review from chengcongdu December 22, 2024 22:51

chengcongdu reviewed Dec 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a3u-gke-gcs blueprint #3454

Add a3u-gke-gcs blueprint #3454

samskillman commented Dec 21, 2024 •

edited

Loading

ighosh98 Dec 21, 2024

chengcongdu Dec 26, 2024

ighosh98 Dec 21, 2024

ighosh98 Dec 21, 2024

ankitkinra Dec 22, 2024

chengcongdu Dec 26, 2024

chengcongdu Dec 26, 2024

chengcongdu Dec 26, 2024

chengcongdu Dec 26, 2024

Add a3u-gke-gcs blueprint #3454

Are you sure you want to change the base?

Add a3u-gke-gcs blueprint #3454

Conversation

samskillman commented Dec 21, 2024 • edited Loading

Submission Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samskillman commented Dec 21, 2024 •

edited

Loading