-
Notifications
You must be signed in to change notification settings - Fork 148
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
13 changed files
with
269 additions
and
245 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
# Creating a Quantized Llama Model in One Shot | ||
|
||
Quantizing a model to a lower precision can save on both memory and speed at inference time. | ||
This example demonstrates how to use the SparseML API to quantize a Llama model from 16 bits | ||
to 4 bits and save it to a compressed-tensors format for inference with vLLM. | ||
|
||
## Step 1: Select a model and dataset | ||
For this example, we will use a TinyLlama model and the open platypus dataset, however | ||
these can be swapped out for any huggingface compatible models and datasets | ||
|
||
```python | ||
model = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T" | ||
dataset = "open_platypus" | ||
``` | ||
|
||
## Step 2: Configure a `GPTQModifier` | ||
Modifiers in sparseml are used to apply optimizations to models. In this example we use a | ||
`GPTQModifier` to apply the GPTQ algorithm to our model. We target all `Linear` layers | ||
for 4-bit weight quantization. These options may be swapped out for any valid `QuantizationScheme`. | ||
|
||
```python | ||
from sparseml.modifiers.quantization.gptq import GPTQModifier | ||
|
||
gptq = GPTQModifier( | ||
targets="Linear", | ||
scheme="W4A16" | ||
) | ||
``` | ||
|
||
|
||
### Step3: One-Shot Compression | ||
|
||
The `oneshot` api applies the created modifier to the target model and dataset. | ||
Setting `save_compressed` to True runs the model through `compressed_tensors` compression | ||
after the quantization is completed. | ||
|
||
```python | ||
from sparseml.transformers import oneshot | ||
|
||
oneshot( | ||
model=model, | ||
dataset=dataset, | ||
recipe=gptq, | ||
save_compressed=True, | ||
output_dir="llama-compressed-example", | ||
overwrite_output_dir=True, | ||
max_seq_length=256, | ||
num_calibration_samples=256, | ||
) | ||
``` |
Oops, something went wrong.