Skip to content

Latest commit

 

History

History
39 lines (25 loc) · 2.75 KB

homework_b.md

File metadata and controls

39 lines (25 loc) · 2.75 KB

Option B: a PhD in tuning DeepSpeed

Reminder: For this setup, you will need two - and preferably more - GPUs. If you're an external student, you can get access to 2x T4 on kaggle - but it requires phone confirmation (at the time of writing). Option B specifically needs a lot of RAM -- or a lot of pain. If you're limited to kaggle free T4s - and if you're not into masochism, we recommend that you choose something else.

Task description: the main objective of this task is to run memory-efficient training algorithms in your multi-GPU setup and compile results in a small report.

You will need to compare two setups:

  • Small model: fine-tune a Bloom-560m model for text classification
  • Large model: either Bloom-7b1 (3B) or the 7B1 version on the same task (or choose any other model with 2~8 billion parameters)

Train both models using the Adam optimizer. You can write your own training code or follow the official tutorial. We recommend that you use the small model for debugging, since larger models may not fit without deepspeed tricks.

First, you will need to install DeepSpeed and integrate it into your training code - and make sure you can train with some basic deepspeed config(1 point).

Once you got the basic code running, you will need to answer three questions -- for both small and large models:

  1. (3pts) How does ZeRO-2 (optimizer & grad sharding) speed compare to ZeRO-3 (full model sharding)?
  • does this difference depend on the training batch size?
  1. (3pts) How does single-GPU offloading compare to training without offloading?
  • what happens if you offload only the optimizer state vs. the model as well?
  • does it change if you combine offloading with data parallelism (ZeRO)?
  1. (3 pts) How does DeepSpeed compare with native alternatives:

Each question is worth 3 points. "Answering" each question requires a benchmark code, example run and saved statistics, e.g. total training time.