Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
homework_a.ipynb		homework_a.ipynb
homework_b.md		homework_b.md
homework_c.md		homework_c.md
lecture.odp		lecture.odp
lecture.pdf		lecture.pdf

README.md

Week 6: Training large models

Lecture: slides, source, video
Practice: video
Homework: see below

Practice / homework

In this homework, you can choose one of 3 tasks to complete:

Option A: ./homework_a.ipynb memory-efficient training and inference - recommended if you have a single GPU
Option B: ./homework_b.md benchmarking ZeRO implementations - requires at least two GPUs and some RAM
Option C: ./homework_c.md write your own model parallelism - requires at least two GPUs

You can do more than one, and we'll award bonus points for that, but doing 2 options will yield (much) less than 2x points. If you're an enrolled student, please only submit the files that you changed (i.e. do not submit homework_b.md if you did option A or C)

We recommend that you choose options B and C if you have access to a computer with at least two GPUs. For YSDA and HSE students, you can use either DataSphere or one of the GPU servers available for this course (recommended). If you are an online student, you can try to register for Kaggle kernels (they ley you run on 2x T4) in a jupyter-like interface. That said, implementing assignments B and C in Kaggle is more difficult than intended. For non-enrolled online students, we recommend option A unless you have access to some other multi-GPU-hardware or are intentionally masochistic.

References

PyTorch gradient checkpointing - API reference
PyTorch native ZeRO - FullyShardedDataParallel
GPipe (one good implementation of pipelining) - arxiv
Megatron-LM - one honking great implementation of large-scale training for transformers - repo
DeepSpeed (a library of many tricks) - repo
- Parameter/Optimizer State Sharding in ZeRO - arxiv blog
- ZeRO-offload - moving gradients and statistics from GPU into RAM - arxiv blog
Alpa (automated parallelism in Jax - https://github.com/alpa-projects/alpa
- ICML'22 tutorial: https://sites.google.com/view/icml-2022-big-model
FairScale - sharded DDP and pipeline from Meta - repo
tensor_parallel - automated tensor parallelism in PyTorch

During the in-class practice, we also had several PyTorch code examples that could come in handy when training large models:

Automatic tensor parallelism:

%pip install tensor_parallel
import tensor_parallel as tp

model = create_a_regular_pytorch_model()
model = tp.tensor_parallel(model, ['cuda:0', 'cuda:1'])
outputs_as_usual = model(input_as_usual)

Note: tensor_parallel is one of the simplest ways to do this kind of distributed training, but not the fastest one. If you want to squeeze every last bit of performance, use DeepSpeed or similar specialized frameworks (see ./homework_b.md)

Gradient checkpointing:

import torch
import torch.nn as nn
from torch.utils.checkpoint import checkpoint, checkpoint_sequential

class Checkpoint(nn.Sequential):
  def forward(self, *inputs):
    return checkpoint(super().forward, *inputs)

class Echo(nn.Module):
  def __init__(self, msg: str):
    super().__init__()
    self.msg = msg  # print this message during forward (for debugging)
  def forward(self, x):
    print("forward", self.msg)
    return x

model = nn.Sequential(
    Checkpoint(nn.Linear(1000, 1000), nn.ReLU(), Echo("layer1 done"),
               nn.Linear(1000, 1000), nn.ReLU(), Echo("layer2 done")),
    Checkpoint(nn.Linear(1000, 1000), nn.ReLU(), Echo("layer3 done"),
               nn.Linear(1000, 1000), nn.ReLU(), Echo("layer4 done")),
    nn.Linear(1000, 1000), nn.ReLU(), Echo("layer5 done"),
)

inputs = torch.randn(16, 1000, requires_grad=True)
# note: we must set inptus requires_grad=True because checkpoints require at least one input with grad for backprop
outputs = model(inputs)
outputs.norm().backward()  # Echo layers will print in the following order: 1 2 3 4 5 3 4 1 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

week06_large_models

week06_large_models

README.md

Week 6: Training large models

Practice / homework

References

Files

week06_large_models

Directory actions

More options

Directory actions

More options

Latest commit

History

week06_large_models

Folders and files

parent directory

README.md

Week 6: Training large models

Practice / homework

References