Automatic limiting of local batchsize bounds after OOM #90
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Here we update the higher limit on local batchsize when we are hit with an OOM. The upper limit is constrained by
LOCAL_BSZ_CUTOFF_PCT
of current local batchsize. We have to take a quick checkpoint and restart after setting the limit because a simple retry doesn't work. PyTorch GPU memory allocator does caching and simply reducing current batchsize doesn't have much of an impact on the total allocated memory (+caches). It results in subsequent OOMs.A new decorator
retry
is introduced to catch the OOM exception as it is not visible from inside the dataloader. The train function should be decorated withretry
which retries (from the position before restart) the training loop after limiting the batchsize of the current dataloader.Fixes #40