How to preserve Pythia's sampling order but for different batch size. #984

lintangsutawika · 2023-07-03T16:56:42Z

Describe the bug

I'd like to observe whether there is any substantial effects of using different batch sizes. It makes sense to use the exact same sampling order as was done on Pythia. To do this, the idea is to set the same number of tokens for each batch size variable and increasing or decreasing the train-iters accordingly.

Double checking sampling order with utils/batch_viewer.py from Pythia, it seems that changing train_micro_batch_size_per_gpu while keeping train-iters the same doesn't affect sampling order. Modifying train-iters based on train_micro_batch_size_per_gpu to keep total number of tokens the same for each run results in different ordering.

These configuration results in the same ordering.

"train_micro_batch_size_per_gpu": 512,
"train-iters": 143000,

"train_micro_batch_size_per_gpu": 1024,
"train-iters": 143000,

This will be an issue if we want to train the same as the original Pythia (300B) tokens because changing train-iters changes the ordering and keeping it while changing train_micro_batch_size_per_gpu will not result in the same amount of tokens.

To Reproduce

Using Pythia's utils/batch_viewer.py with utils/dummy_config.yml adjusted. I only observed the first 2 steps for bs512 and first step for bs1024.
Detokenize the npy files and compare the text directly. I guess you can skip this part and observe the tokens directly as well.

Expected behavior
A clear and concise description of what you expected to happen.

Proposed solution
Not yet sure what the direct solution is, this might be an issue in how the dataset is loaded based on batch size.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

GPUs:
Configs:

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

lintangsutawika · 2023-07-03T16:57:22Z

@haileyschoelkopf @uSaiPrashanth maybe you both have any idea to this issue?

uSaiPrashanth · 2023-07-03T17:45:18Z

From what I have observed, as long as you keep the number of epochs and sequence length the same, your batch size (or) number of train iters should not matter (ref: https://github.com/EleutherAI/gpt-neox/blob/main/megatron/data/gpt2_dataset.py#L187)

Modifying train-iters based on train_micro_batch_size_per_gpu to keep total number of tokens the same for each run results in different ordering.

Could you check if this changes the number of epochs you're training on?

StellaAthena · 2023-07-05T17:47:26Z

Here's a hack that should get around this: Keep train-iters unchanged but modify lr_decay_iters. This will cause the LR decay rate to act as if the training is for a shorter model, then you can deliberately crash the run once it's trained for the desired amount of tokens.

lintangsutawika added the bug Something isn't working label Jul 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to preserve Pythia's sampling order but for different batch size. #984

How to preserve Pythia's sampling order but for different batch size. #984

lintangsutawika commented Jul 3, 2023

lintangsutawika commented Jul 3, 2023

uSaiPrashanth commented Jul 3, 2023

StellaAthena commented Jul 5, 2023

How to preserve Pythia's sampling order but for different batch size. #984

How to preserve Pythia's sampling order but for different batch size. #984

Comments

lintangsutawika commented Jul 3, 2023

lintangsutawika commented Jul 3, 2023

uSaiPrashanth commented Jul 3, 2023

StellaAthena commented Jul 5, 2023