Pytorch Distributed + NVIDIA DGX Support. #4393
Unanswered
dhfromkorea
asked this question in
Q&A
Replies: 1 comment 2 replies
-
Saturn Cloud has done some work in this space. Maybe that is a good starting point? https://www.saturncloud.io/s/combining-dask-and-pytorch-for-better-faster-transfer-learning/ cc @skirmer @jameslamb (who may have more advice here) |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello Dask community, thank you so much for making such an inspiring project.
I am trying to understand the feasibility of using Dask for managing pytorch distributed training across multiple GPUs & nodes under NVIDIA DGX-1 system. I am trying to use Prefect to define our ML workflows and let Dask to execute and monitor jobs running pytorch distributed code.
Here are my newbie questions...
would handling Pytorch distributed workload be feasible under the combination of Dask as job scheduler + SLURM as resource/job manager + NVIDIA DGX-1 as underlying compute nodes?
If feasible in limited capacity, what concrete challenges should I expect? Any advice how to overcome them?
could you share a most relevant, state-of-the-art example to reference where pytorch distributed is used with Dask? I see there's a pytorch integration under https://ml.dask.org/index.html but I am not quite sure if dask-ml is something relevant or if it is a must to use Skorch to use Pytorch distributed+Dask.
Any advice or pointers would be really great. For context, I read the 2019 blogpost that mentions UCX integration (which seemed suggesting multi-node workload support in DGX is not there yet) and a recent article from the NVIDIA team 1 that seems to suggest now there may be a viable path forward.
Beta Was this translation helpful? Give feedback.
All reactions