Pytorch Distributed + NVIDIA DGX Support. #4393

dhfromkorea · 2020-12-24T13:39:36Z

dhfromkorea
Dec 24, 2020

Hello Dask community, thank you so much for making such an inspiring project.

I am trying to understand the feasibility of using Dask for managing pytorch distributed training across multiple GPUs & nodes under NVIDIA DGX-1 system. I am trying to use Prefect to define our ML workflows and let Dask to execute and monitor jobs running pytorch distributed code.

Here are my newbie questions...

would handling Pytorch distributed workload be feasible under the combination of Dask as job scheduler + SLURM as resource/job manager + NVIDIA DGX-1 as underlying compute nodes?

If feasible in limited capacity, what concrete challenges should I expect? Any advice how to overcome them?

could you share a most relevant, state-of-the-art example to reference where pytorch distributed is used with Dask? I see there's a pytorch integration under https://ml.dask.org/index.html but I am not quite sure if dask-ml is something relevant or if it is a must to use Skorch to use Pytorch distributed+Dask.

Any advice or pointers would be really great. For context, I read the 2019 blogpost that mentions UCX integration (which seemed suggesting multi-node workload support in DGX is not there yet) and a recent article from the NVIDIA team 1 that seems to suggest now there may be a viable path forward.

jakirkham · 2021-02-08T23:03:55Z

jakirkham
Feb 8, 2021
Collaborator

Saturn Cloud has done some work in this space. Maybe that is a good starting point?

https://www.saturncloud.io/s/combining-dask-and-pytorch-for-better-faster-transfer-learning/

cc @skirmer @jameslamb (who may have more advice here)

2 replies

skirmer Feb 9, 2021

Hello! So we have a package that integrates Pytorch and Dask using Pytorch's DDP framework. (dask-pytorch-ddp, on pypi) The major considerations for training will be ensuring that the workers are communicating correctly, since they all need to synchronize at the end of each epoch (that's what the library helps with). I wrote the blogpost above, so I recommend it of course, and if you have questions about it let me know! We're continually evolving the work, and trying to make improvements.

skirmer Feb 9, 2021

Also, I don't recommend going the Skorch route unless you really need the sklearn API - we didn't find it actually helped with the dask <> pytorch integration much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pytorch Distributed + NVIDIA DGX Support. #4393

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Pytorch Distributed + NVIDIA DGX Support. #4393

dhfromkorea Dec 24, 2020

Replies: 1 comment · 2 replies

jakirkham Feb 8, 2021 Collaborator

skirmer Feb 9, 2021

skirmer Feb 9, 2021

dhfromkorea
Dec 24, 2020

Replies: 1 comment 2 replies

jakirkham
Feb 8, 2021
Collaborator