Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to support nvlink between several k8s pods? #347

Closed
pokerc opened this issue Nov 23, 2022 · 5 comments
Closed

how to support nvlink between several k8s pods? #347

pokerc opened this issue Nov 23, 2022 · 5 comments

Comments

@pokerc
Copy link

pokerc commented Nov 23, 2022

Hi there, we are now using GDX station A100 cluster, we are training AI models using k8s with pytorch-operator from kubeflow. When we start a pytorch job, we got 8 pods (1master, 7 workers) and we use pytorch DDP in our code. But we find that the nvlink between the 8 pods is not working (the 8 pods are in the same node), so we want to know is k8s-device-plugin can enable nvlink between the pods. Or is there any other k8s plugin can help us do that?

@klueska
Copy link
Contributor

klueska commented Nov 23, 2022

The plugin doesn’t control that. So long as you have access to the GPUs and you have fabricmanager running on your host they should be able to communicate using NVLINK

@pokerc
Copy link
Author

pokerc commented Nov 28, 2022

The plugin doesn’t control that. So long as you have access to the GPUs and you have fabricmanager running on your host they should be able to communicate using NVLINK

By k8s-device-plugin when we allocate a gpu to a pod, in the pod we can just see the only one gpu that has been allocated but can't see other gpus on the same host(use nvidia-smi). So the allocated gpu can't communicate with other gpus because there is no topo info in the pod I think. I wonder is there any solution to add topo info into the pod?

@shaowei-su
Copy link

Following thread - device plugin seems only expose single GPU info to the given pod and thus pods can only communicate via NET/Socket/0.

@lipovsek-aws
Copy link

hi, enroot/pyxis on Slurm seems to have a workaround addressed this NCCL challenge - is there any plan to adopt it for k8s? Thanks!

@pokerc pokerc closed this as completed Sep 6, 2023
@freelizhun
Copy link

Hi, @pokerc. Have you solved this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants