how to support nvlink between several k8s pods? #347

pokerc · 2022-11-23T13:04:44Z

Hi there, we are now using GDX station A100 cluster, we are training AI models using k8s with pytorch-operator from kubeflow. When we start a pytorch job, we got 8 pods (1master, 7 workers) and we use pytorch DDP in our code. But we find that the nvlink between the 8 pods is not working (the 8 pods are in the same node), so we want to know is k8s-device-plugin can enable nvlink between the pods. Or is there any other k8s plugin can help us do that?

klueska · 2022-11-23T13:08:41Z

The plugin doesn’t control that. So long as you have access to the GPUs and you have fabricmanager running on your host they should be able to communicate using NVLINK

pokerc · 2022-11-28T03:54:14Z

The plugin doesn’t control that. So long as you have access to the GPUs and you have fabricmanager running on your host they should be able to communicate using NVLINK

By k8s-device-plugin when we allocate a gpu to a pod, in the pod we can just see the only one gpu that has been allocated but can't see other gpus on the same host(use nvidia-smi). So the allocated gpu can't communicate with other gpus because there is no topo info in the pod I think. I wonder is there any solution to add topo info into the pod?

shaowei-su · 2023-04-28T17:30:35Z

Following thread - device plugin seems only expose single GPU info to the given pod and thus pods can only communicate via NET/Socket/0.

lipovsek-aws · 2023-05-26T18:12:35Z

hi, enroot/pyxis on Slurm seems to have a workaround addressed this NCCL challenge - is there any plan to adopt it for k8s? Thanks!

freelizhun · 2024-08-06T06:39:26Z

Hi, @pokerc. Have you solved this problem?

pokerc closed this as completed Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to support nvlink between several k8s pods? #347

how to support nvlink between several k8s pods? #347

pokerc commented Nov 23, 2022

klueska commented Nov 23, 2022

pokerc commented Nov 28, 2022

shaowei-su commented Apr 28, 2023

lipovsek-aws commented May 26, 2023

freelizhun commented Aug 6, 2024

how to support nvlink between several k8s pods? #347

how to support nvlink between several k8s pods? #347

Comments

pokerc commented Nov 23, 2022

klueska commented Nov 23, 2022

pokerc commented Nov 28, 2022

shaowei-su commented Apr 28, 2023

lipovsek-aws commented May 26, 2023

freelizhun commented Aug 6, 2024