You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'd like to observe whether there is any substantial effects of using different batch sizes. It makes sense to use the exact same sampling order as was done on Pythia. To do this, the idea is to set the same number of tokens for each batch size variable and increasing or decreasing the train-iters accordingly.
Double checking sampling order with utils/batch_viewer.py from Pythia, it seems that changing train_micro_batch_size_per_gpu while keeping train-iters the same doesn't affect sampling order. Modifying train-iters based on train_micro_batch_size_per_gpu to keep total number of tokens the same for each run results in different ordering.
This will be an issue if we want to train the same as the original Pythia (300B) tokens because changing train-iters changes the ordering and keeping it while changing train_micro_batch_size_per_gpu will not result in the same amount of tokens.
To Reproduce
Using Pythia's utils/batch_viewer.py with utils/dummy_config.yml adjusted. I only observed the first 2 steps for bs512 and first step for bs1024.
Detokenize the npy files and compare the text directly. I guess you can skip this part and observe the tokens directly as well.
Expected behavior
A clear and concise description of what you expected to happen.
Proposed solution
Not yet sure what the direct solution is, this might be an issue in how the dataset is loaded based on batch size.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
GPUs:
Configs:
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered:
Here's a hack that should get around this: Keep train-iters unchanged but modify lr_decay_iters. This will cause the LR decay rate to act as if the training is for a shorter model, then you can deliberately crash the run once it's trained for the desired amount of tokens.
Describe the bug
I'd like to observe whether there is any substantial effects of using different batch sizes. It makes sense to use the exact same sampling order as was done on Pythia. To do this, the idea is to set the same number of tokens for each batch size variable and increasing or decreasing the train-iters accordingly.
Double checking sampling order with
utils/batch_viewer.py
from Pythia, it seems that changingtrain_micro_batch_size_per_gpu
while keepingtrain-iters
the same doesn't affect sampling order. Modifyingtrain-iters
based ontrain_micro_batch_size_per_gpu
to keep total number of tokens the same for each run results in different ordering.These configuration results in the same ordering.
This will be an issue if we want to train the same as the original Pythia (300B) tokens because changing
train-iters
changes the ordering and keeping it while changingtrain_micro_batch_size_per_gpu
will not result in the same amount of tokens.To Reproduce
utils/batch_viewer.py
withutils/dummy_config.yml
adjusted. I only observed the first 2 steps for bs512 and first step for bs1024.Expected behavior
A clear and concise description of what you expected to happen.
Proposed solution
Not yet sure what the direct solution is, this might be an issue in how the dataset is loaded based on batch size.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: