You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I train my model with only FSDP (data_parallel_shard_degree=4), I'm able to run. However, when I enable HSDP by setting data_parallel_replicate_degree=2, I get the following error followed by a Python seg fault:
[0] misc/ibvwrap.cc:169 NCCL WARN Call to ibv_reg_mr_iova2 failed with error Cannot allocate memory
If I additionally set export NCCL_IB_DISABLE=1 then I am able to train. However, I've read online that this can slow down communication time significantly. I tried disabling for FSDP and noticed the training time double.
Is this an issue with my IB setup? Why might it only happen for HSDP, but not FSDP? Thanks!
The text was updated successfully, but these errors were encountered:
I'm attempting to run HSDP on 2 nodes each with 2 A100s. I'm using a script similar to
multinode_trainer.slurm
Currently, I've set the following:
When I train my model with only FSDP (
data_parallel_shard_degree=4
), I'm able to run. However, when I enable HSDP by settingdata_parallel_replicate_degree=2
, I get the following error followed by a Python seg fault:If I additionally set
export NCCL_IB_DISABLE=1
then I am able to train. However, I've read online that this can slow down communication time significantly. I tried disabling for FSDP and noticed the training time double.Is this an issue with my IB setup? Why might it only happen for HSDP, but not FSDP? Thanks!
The text was updated successfully, but these errors were encountered: