-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to support nvlink between several k8s pods? #347
Comments
The plugin doesn’t control that. So long as you have access to the GPUs and you have fabricmanager running on your host they should be able to communicate using NVLINK |
By k8s-device-plugin when we allocate a gpu to a pod, in the pod we can just see the only one gpu that has been allocated but can't see other gpus on the same host(use nvidia-smi). So the allocated gpu can't communicate with other gpus because there is no topo info in the pod I think. I wonder is there any solution to add topo info into the pod? |
Following thread - device plugin seems only expose single GPU info to the given pod and thus pods can only communicate via |
hi, enroot/pyxis on Slurm seems to have a workaround addressed this NCCL challenge - is there any plan to adopt it for k8s? Thanks! |
Hi, @pokerc. Have you solved this problem? |
Hi there, we are now using GDX station A100 cluster, we are training AI models using k8s with pytorch-operator from kubeflow. When we start a pytorch job, we got 8 pods (1master, 7 workers) and we use pytorch DDP in our code. But we find that the nvlink between the 8 pods is not working (the 8 pods are in the same node), so we want to know is k8s-device-plugin can enable nvlink between the pods. Or is there any other k8s plugin can help us do that?
The text was updated successfully, but these errors were encountered: