CUDA OOM when saving checkpoint (in consolidate_state_dict()) using OSS #973

crowsonkb · 2022-04-20T17:39:09Z

I am experiencing CUDA out of memory crashes when consolidating my optimizer state dict before saving it. I am training on 32 40GB A100s, four nodes with eight GPUs each, using PyTorch Lightning's 'ddp_sharded' strategy, which is OSS. I get the OOM crash in the middle of running consolidate_state_dict(). I have tried adding del statements, gc.collect() and torch.cuda.empty_cache() inside the loop to no avail. I am using a custom optimizer class, a modified AdamW that also saves an exponential moving average of the weights, and I need optimizer state sharding because the extra memory overhead for the EMA weights is so onerous. Here is the custom optimizer code: https://gist.github.com/crowsonkb/ea0ed1f6e88594046c72735f3cef1d05. I don't understand how I am running out of GPU memory partway through consolidate_state_dict() (I put in print statements and it got through 27 of 32 ranks) since it moves the tensors to CPU after each broadcast. I am using NCCL so it has to broadcast on GPU but it copies to CPU right afterwards.

Thank you,
Katherine Crowson

The text was updated successfully, but these errors were encountered:

crowsonkb · 2022-04-20T17:43:19Z

My fairscale version is 0.4.6, my PyTorch version is 1.11.0+cu113, and my PyTorch Lightning version is 1.6.1.

crowsonkb · 2022-04-21T18:32:18Z

It is the same as this issue AFAICT: huggingface/transformers#14542

crowsonkb · 2022-04-21T18:47:11Z

Except that I can't fix it by setting force_broadcast_object=True when I create the optimizer because it just hangs instead.

min-xu-ai · 2022-04-25T14:36:13Z

Do you have a full trace back for the OOM crash? Pasting it in a gist is fine.

aced125 · 2022-06-01T18:35:11Z

Getting this as well

aced125 · 2022-06-01T20:19:48Z

I fixed this by enabling force_broadcast_object=True in the fairscale.optim.OSS initialization. On V100s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA OOM when saving checkpoint (in consolidate_state_dict()) using OSS #973

CUDA OOM when saving checkpoint (in consolidate_state_dict()) using OSS #973

crowsonkb commented Apr 20, 2022

crowsonkb commented Apr 20, 2022

crowsonkb commented Apr 21, 2022

crowsonkb commented Apr 21, 2022

min-xu-ai commented Apr 25, 2022

aced125 commented Jun 1, 2022

aced125 commented Jun 1, 2022

CUDA OOM when saving checkpoint (in consolidate_state_dict()) using OSS #973

CUDA OOM when saving checkpoint (in consolidate_state_dict()) using OSS #973

Comments

crowsonkb commented Apr 20, 2022

crowsonkb commented Apr 20, 2022

crowsonkb commented Apr 21, 2022

crowsonkb commented Apr 21, 2022

min-xu-ai commented Apr 25, 2022

aced125 commented Jun 1, 2022

aced125 commented Jun 1, 2022