Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set hostNetwork to True for TPUGKEJob #641

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

samos123
Copy link
Contributor

@samos123 samos123 commented Aug 8, 2024

This helped increase v5e-512 performance from 50% MFU to 58% MFU. The gains could be more significant when going beyond 2 slices.

Majority of Google TPU benchmarks ran on GKE use hostNetwork=true. That's also what XPK sets by default.

The performance increase is due to being able to bypass container networking. Setting hostNetwork=true allows us to directly utilize the host NICs without having to first traverse container NIC and linux network bridges.

@samos123 samos123 marked this pull request as ready for review August 8, 2024 19:51
Copy link
Contributor

@ruomingp ruomingp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

volumes=volumes,
)

# hostNetwork True and dnsPolicy do not work with Workload Identity and GCS Fuse.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we should check here that workload identity is not being used, not just gcs fuse --- as sync'ed offline, will do more testing to see whether necessary before merging.

Copy link
Contributor

@amcw7777 amcw7777 Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I have more context about how hostNetwork and dnsPolicy do not work with Workload Identity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GKE Workload Identity requires forcing metadata traffic to a specific pod. However, if you set hostNetwork true that's no longer possible so it doesn't work.

Note your GKE clusters can keep on using Workload Identity, however the issue is that it won't work for any pod that is using hostNetwork: true. All your other pods using hostNetwork: false will continue to be able to utilize workload identity just like before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants