Automatic limiting of local batchsize bounds after OOM #90

odp · 2021-01-12T23:37:06Z

Here we update the higher limit on local batchsize when we are hit with an OOM. The upper limit is constrained by LOCAL_BSZ_CUTOFF_PCT of current local batchsize. We have to take a quick checkpoint and restart after setting the limit because a simple retry doesn't work. PyTorch GPU memory allocator does caching and simply reducing current batchsize doesn't have much of an impact on the total allocated memory (+caches). It results in subsequent OOMs.

A new decorator retry is introduced to catch the OOM exception as it is not visible from inside the dataloader. The train function should be decorated with retry which retries (from the position before restart) the training loop after limiting the batchsize of the current dataloader.

Fixes #40

odp added 15 commits December 16, 2020 18:34

first shot

c43900e

Merge branch 'master' into gpu-memory-exc

111765e

introduce retry

8ccc3f3

fix cudaoom

95bf28d

revert silly changes

7c0e24c

fixes

5aa1084

directly reset the bounds

3689d33

remove dataloader argument

8a64bd9

cleanups

f5e797e

more cleanups

af06f5f

controller fix

116f2bc

Merge branch 'master' into gpu-memory-exc

43c5929

cleanup

1cd1bc5

lint

e535ce9

lint

cf3d41b

odp requested a review from aurickq January 12, 2021 23:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic limiting of local batchsize bounds after OOM #90

Automatic limiting of local batchsize bounds after OOM #90

odp commented Jan 12, 2021 •

edited

Loading

Automatic limiting of local batchsize bounds after OOM #90

Are you sure you want to change the base?

Automatic limiting of local batchsize bounds after OOM #90

Conversation

odp commented Jan 12, 2021 • edited Loading

odp commented Jan 12, 2021 •

edited

Loading