Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a3u-gke-gcs blueprint #3454

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

samskillman
Copy link
Collaborator

@samskillman samskillman commented Dec 21, 2024

In addition to the blueprint, which gives an opinionated way to mount buckets for training and checkpointing, I modified the gke-persistent-volume to be able to use the mount_options specified in network_storage. It previously just hardcoded implicit-dirs.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@samskillman samskillman added the release-key-new-features Added to release notes under the "Key New Features" heading. label Dec 21, 2024
@samskillman samskillman requested a review from cboneti December 21, 2024 07:46
# Cidr block containing the IP of the machine calling terraform and kubectl
# The value can be more specific if the IPs are known which will run kubectl
# e.g. the local system running Terraform or a remote node
authorized_cidr: 0.0.0.0/0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be kept as the default value? instead of the traditional
<your-ip-address>/32 in other blueprints?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel keeping it as /32 can give user a better consistent experience.
they can always set it to 0.0.0.0/0 to open it up (maybe have a comment in the blueprint for this?)


# Install Kueue, Jobset, and NCCL installer
- id: workload-manager-install
source: github.com/GoogleCloudPlatform/cluster-toolkit.git//modules/management/kubectl-apply?ref=e0c690b
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kueue v0.10.0 is the recommended solution for A3 Ultra GA. Please consider modifying this piece in line with that in the latest blueprint

resources:
- name: "nvidia.com/gpu"
nominalQuota: ${num_gpus}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In line with using Kueue v0.10.0 as the recommended solution, please modify this configuration file to be similar to the the tas-queues.yaml file

Also, have we tried using this configuration to provision a cluster and schedule a workload?

subnet_ip: 192.168.64.0/18

- id: gke-a3-ultra-rdma-net
source: github.com/GoogleCloudPlatform/cluster-toolkit.git//community/modules/network/rdma-vpc?ref=98c49fe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the latest PR #3456. If this needs to be on the main branch , we will remove the refs and use "versioned blueprints" to refer to the latest develop branch

# Cidr block containing the IP of the machine calling terraform and kubectl
# The value can be more specific if the IPs are known which will run kubectl
# e.g. the local system running Terraform or a remote node
authorized_cidr: 0.0.0.0/0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel keeping it as /32 can give user a better consistent experience.
they can always set it to 0.0.0.0/0 to open it up (maybe have a comment in the blueprint for this?)

training_bucket_name: # Name of bucket that holds training data
checkpoint_bucket_name: # Name of bucket used for checkpoints
system_node_pool_disk_size_gb: 200
a3ultra_node_pool_disk_size_gb: 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100G seems too low for AI/ML images, this could lead to disk out of space issue and pod stuck in pulling image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we seen this kind of issue with 200G space before, and end up using 500

* **Cloud Storage Fuse Integration:** Enables seamless access to GCS buckets
from within your containers using the Cloud Storage Fuse CSI Driver. Cloud
Storage Fuse is configured to utilize the 12 TB of Local SSD
* **Hierarchical Namespace Buckets:** Leverages GCS buckets with Hierarchical
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCS HNS only supported by GKE 1.31 and later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-key-new-features Added to release notes under the "Key New Features" heading.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants