Skip to content

Commit

Permalink
Add Spinquant benchmark results to README
Browse files Browse the repository at this point in the history
  • Loading branch information
tobiasvanderwerff committed Oct 17, 2024
1 parent 78bcd31 commit 0eb02fd
Showing 1 changed file with 24 additions and 2 deletions.
26 changes: 24 additions & 2 deletions torchao/prototype/spinquant/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,30 @@ Re-implementation of SpinQuant based on the official code implementation (https:

## Usage

Using this implementation with CUDA requires installing the Fast Hadamard Transform CUDA package, which can be done as follows:
For optimal performance on CUDA GPUs, install the Fast Hadamard Transform package:

```shell
pip install git+https://github.com/Dao-AILab/fast-hadamard-transform.git
```
```

## Performance

See https://github.com/pytorch/ao/pull/983 for Wikitext benchmark results.

Without `torch.compile`:

| Configuration | Average tokens/sec | Average Bandwidth (GB/s) | Peak Memory Usage (GB) | Model Size (GB) |
|----------------|--------------------|--------------------------|------------------------|-----------------|
| Baseline | 27.33 | 361.21 | 13.62 | 13.21 |
| Spinquant (R4) | 23.01 | 304.10 | 14.24 | 13.22 |

With `torch.compile`:

| Configuration | Average tokens/sec | Average Bandwidth (GB/s) | Peak Memory Usage (GB) | Model Size (GB) |
|----------------------|--------------------|--------------------------|------------------------|-----------------|
| Baseline | 114.08 | 1507.58 | 13.88 | 13.21 |
| Spinquant (R4) | 109.59 | 1448.61 | 13.72 | 13.22 |
| Spinquant (R1+R2+R4) | 109.64 | 1449.28 | 14.90 | 13.22 |


NB: R1 and R2 are fused into the linear weights before inference takes place, so it is expected that they do not lead to additional overhead at inference time.

0 comments on commit 0eb02fd

Please sign in to comment.