Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

udpate llama7b_sparse_quantized example #2322

Merged
merged 6 commits into from
Jun 13, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 58 additions & 26 deletions examples/llama7b_sparse_quantized/README.md
Original file line number Diff line number Diff line change
@@ -1,46 +1,76 @@
# Creating a Sparse Quantized Llama7b Model

The example in this folder runs in multiple stages to create a Llama 7b model with
a 2:4 sparsity pattern and W4A16 post training quantization (PTW). The model is
calibrated and trained with the ultachat200k dataset. At least 75GB of GPU memory is
required to run this example.
This example uses SparseML and Compressed-Tensors to create a 2:4 sparse and quantized Llama2-7b model.
The model is calibrated and trained with the ultachat200k dataset.
At least 75GB of GPU memory is required to run this example.

## Recipe Summary
Follow the steps below, or to run the example as `python examples/llama7b_sparse_quantized/llama7b_sparse_w4a16.py`

The recipe used for this flow is located in [2:4_w4a16_recipe.yaml](./2:4_w4a16_recipe.yaml). It contains 3 stages that are outlined below.
## Step 1: Select a model, dataset, and recipe
In this step, we select which model to use as a baseline for sparsification, a dataset to
use for calibration and finetuning, and a recipe.

Models can reference a local directory, model in the huggingface hub, or in the sparsezoo.

### Stage 1: Sparsification
Datasets can be from a local compatible directory or the huggingface hub.

Runs the SparseGPT one-shot algorithm to prune the model to 50% sparsity with a 2:4
sparsity pattern. This means that 2 weights out of every group of 4 weights are masked to 0.
Recipes are YAML files that describe how a model should be optimized during or after training.
The recipe used for this flow is located in [2:4_w4a16_recipe.yaml](./2:4_w4a16_recipe.yaml).
It contains instructions to prune the model to 2:4 sparsity, run one epoch of recovery finetuning,
and quantize to 4 bits in one show using GPTQ.

### Stage 2: Finetuning Recovery

This stage runs a single epoch of training on the ultrachat200k dataset while maintaining
the sparsity mask from stage 1. The purpose of this stage is to recover any accuracy lost
during the sparsification process.
```python
import torch
from sparseml.transformers import SparseAutoModelForCausalLM

### Stage 3: Quantization
model_stub = "zoo:llama2-7b-ultrachat200k_llama2_pretrain-base"
model = SparseAutoModelForCausalLM.from_pretrained(
model_stub, torch_dtype=torch.bfloat16, device_map="auto"
)

Finally, we run the GPTQ one-shot algorithm to quantize all linear weights to 4 bit
channelwise.
dataset = "ultrachat-200k"
splits = {"calibration": "train_gen[:5%]", "train": "train_gen"}
Satrat marked this conversation as resolved.
Show resolved Hide resolved

## How to Run
recipe = "2:4_w4a16_recipe.yaml"
```

We can run the entire staged recipe with one call to SparseML's `apply` pathway. This
will save a checkpoint of the model after each stage.
## Step 2: Run sparsification using `apply`
The `apply` function applies the given recipe to our model and dataset.
The hardcoded kwargs may be altered based on each model's needs.
After running, the sparsified model will be saved to `output_llama7b_2:4_w4a16_channel`.

```python
from sparseml.transformers import apply

output_dir = "output_llama7b_2:4_w4a16_channel"

apply(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of arguments here is very confusing, especially since most of these are related to training...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked to Ben and he is going to write up a README of just quantization without the training. This one is intended to be a more advanced readme showing how to do the full sparsity -> finetuning -> quantization flow

model=model,
dataset=dataset,
recipe=recipe,
bf16=False, # use full precision for training
output_dir=output_dir,
splits=splits,
max_seq_length=512,
num_calibration_samples=512,
num_train_epochs=0.5,
logging_steps=500,
save_steps=5000,
gradient_checkpointing=True,
learning_rate=0.0001,
lr_scheduler_type="cosine",
warmup_ratio=0.1,
)
```

```python examples/llama7b_sparse_quantized/llama7b_sparse_w4a16.py```

### Compression
### Step3: Compression
Satrat marked this conversation as resolved.
Show resolved Hide resolved

The resulting model will be uncompressed. To save a final compressed copy of the model
run the following:

```
import torch
from sparseml import SparseAutoModelForCausalLM
```python
compressed_output_dir = "output_llama7b_2:4_w4a16_channel_compressed"

model = SparseAutoModelForCausalLM.from_pretrained(output_dir, torch_dtype=torch.bfloat16)
model.save_pretrained(compressed_output_dir, save_compressed=True)
Expand All @@ -49,4 +79,6 @@ model.save_pretrained(compressed_output_dir, save_compressed=True)
### Custom Quantization
The current repo supports multiple quantization techniques configured using a recipe. Supported strategies are `tensor`, `group` and `channel`.
The above recipe (`2:4_w4a16_recipe.yaml`) uses channel-wise quantization specified by `strategy: "channel"` in its config group.
To use quantize per tensor, change strategy from `channel` to `tensor`. To use group size quantization, change from `channel` to `group` and specify its value, say 128, by including `group_size: 128`. Group size quantization example is shown in `2:4_w4a16_group-128_recipe.yaml`
To use quantize per tensor, change strategy from `channel` to `tensor`.
To use group size quantization, change from `channel` to `group` and specify its value, say 128, by including `group_size: 128`.
Group size quantization example is shown in `2:4_w4a16_group-128_recipe.yaml`