[WIP] SmoothQuant using tensor subclassing #1030

Xia-Weiwen · 2024-10-08T00:57:22Z

Still WIP

The implementation of SmoothQuant with tensor subclassing (AffineQuantizedTensor) is similar to that of AWQ with the following differences:

SmoothQuant supports both static and dynamic quantization of activation while AWQ only uses dynamic
Matmul is computed in int8 instead of floating point (at least at op level)
The smoothing factor is calculated differently from the equalization scales of AWQ

pytorch-bot · 2024-10-08T00:57:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1030

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit fa1144c with merge base d4b2f33 ():

NEW FAILURE - The following job has failed:

Run Regression Tests / test (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://download.pytorc... / linux-job (gh)
RuntimeError: Command docker exec -t 95559b07c7091cb50f330d25bb5604fd4430cf7f95edb1ff7928625590e1e78f /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jerryzh168 · 2024-10-11T03:29:15Z

torchao/prototype/smoothquant/api.py

+    return insert_subclass
+
+
+def save_smooth_quant_recipe(model: torch.nn.Module, save_path: str) -> Dict[str, torch.Tensor]:


do we need this? or just saving the state_dict for observed model is enough?

We want to have an API to modify (tune) quantization parameters, i.e. the recipe here. Do you have any concern about adding this API?

so the state_dict is supposed to be used by other APIs to tune quantization parameters? I think that's fine if you have this use case in mind, is the model with SmoothQuantObservedLinear not serializable by itself?

SmoothQuantObservedLinear is serializable. However, a recipe is more flexible to tune parameters. Thanks.

torchao/prototype/smoothquant/api.py

Xia-Weiwen · 2024-10-14T03:11:45Z

Hi @jerryzh168 I added a new tensor subclass LinearActivationScaleQuantizedTensor to support x -> x/scale -> quantize x for torch.compile.

If I use LinearActivationQuantizedTensor, the x/sacle is done outside the class (by input_quant_func) and there is a dynamo error about scale during torch.compile. I guess it's because the scale tensors are not on the graph in this case. Putting the scale in the weight tensor solves the problem. And WeightTensorWithLinearActivationScaleMetadata does not quantize activation.

Do you have any concern adding this new class? Thanks.

jerryzh168 · 2024-10-16T21:09:31Z

Hi @jerryzh168 I added a new tensor subclass LinearActivationScaleQuantizedTensor to support x -> x/scale -> quantize x for torch.compile.

If I use LinearActivationQuantizedTensor, the x/sacle is done outside the class (by input_quant_func) and there is a dynamo error about scale during torch.compile. I guess it's because the scale tensors are not on the graph in this case. Putting the scale in the weight tensor solves the problem. And WeightTensorWithLinearActivationScaleMetadata does not quantize activation.

Do you have any concern adding this new class? Thanks.

I think in this case we should be composing WeightTensorWithLinearActivationScaleMetadata LinearActivationQuantizedTensor together, i.e.

weight = to_affine_quantized(float_weight, ...)
# this will quantize input
# use https://github.com/pytorch/ao/blob/c87cc9b7286a46e9dfc076fa2417eb9b64ccc807/torchao/quantization/weight_tensor_linear_activation_quantization.py#L13 for static quantization
weight = to_linear_activation_quantized_tensor(weight)  # dynamic quant
# this will do x / scale
weight = to_weight_tensor_with_linear_activation_scale_metadata(weight)

in dispatch time, we first unwrap the outer most tensor subclass, which will be WeightTensorWithLinearActivationScaleMetadata, so we'll apply scale to activation, then LinearActivationQuantizedTensor, which will quantize activation, and then AffineQuantizedTensor

would this work?

the naming for different tensor subclasses is a bit confusing right now I think, we should cleanup a bit later

Xia-Weiwen · 2024-10-17T13:56:37Z

I think in this case we should be composing WeightTensorWithLinearActivationScaleMetadata LinearActivationQuantizedTensor together, i.e.
weight = to_affine_quantized(float_weight, ...)
# this will quantize input
# use https://github.com/pytorch/ao/blob/c87cc9b7286a46e9dfc076fa2417eb9b64ccc807/torchao/quantization/weight_tensor_linear_activation_quantization.py#L13 for static quantization
weight = to_linear_activation_quantized_tensor(weight)  # dynamic quant
# this will do x / scale
weight = to_weight_tensor_with_linear_activation_scale_metadata(weight)
in dispatch time, we first unwrap the outer most tensor subclass, which will be WeightTensorWithLinearActivationScaleMetadata, so we'll apply scale to activation, then LinearActivationQuantizedTensor, which will quantize activation, and then AffineQuantizedTensor

would this work?

the naming for different tensor subclasses is a bit confusing right now I think, we should cleanup a bit later

It works. Thanks

Xia-Weiwen · 2024-10-18T06:15:38Z

Hi @jerryzh168 It's weird that if I add these lines https://github.com/pytorch/ao/blob/f595ed41b99685cc16fc480ca2218965bb812bed/torchao/kernel/intmm.py#L142C1-L146C1 to avoid overflow of float16, there will be a failure in test_spinquant.py, but the test does not use float16 at all. And there is another failure with CUDA nightly, but its log cannot be loaded. And these failures cannot be reproduced in my local environment or on an AWS instance.
So, I have to remove these line and also remove tests for float16. If users try to run in fp16, they will get overflow as well. Do you have any suggestions? Thanks.

jerryzh168 · 2024-10-18T17:57:37Z

Hi @jerryzh168 It's weird that if I add these lines f595ed4/torchao/kernel/intmm.py#L142C1-L146C1 to avoid overflow of float16, there will be a failure in test_spinquant.py, but the test does not use float16 at all.

what is the test failure? is it possible to do the dtype conversion before calling int_scaled_matmul, e.g. before
y_dot_scaled = int_scaled_matmul(tmp, w_vals_int8_t, x_scales.reshape(-1, 1)) in affine_quantized_tensor.py?

And there is another failure with CUDA nightly, but its log cannot be loaded. And these failures cannot be reproduced in my local environment or on an AWS instance. So, I have to remove these line and also remove tests for float16. If users try to run in fp16, they will get overflow as well. Do you have any suggestions? Thanks.

I just saw the error, it is talking about some triton error:


  E1018 18:45:40.023529 436 site-packages/torch/_inductor/runtime/triton_heuristics.py:475]     assert lhs.shape[1].value >= 32, "small blocks not supported!"
  E1018 18:45:40.023529 436 site-packages/torch/_inductor/runtime/triton_heuristics.py:475] AssertionError: small blocks not supported!

Xia-Weiwen · 2024-10-19T07:00:09Z

what is the test failure? is it possible to do the dtype conversion before calling int_scaled_matmul, e.g. before y_dot_scaled = int_scaled_matmul(tmp, w_vals_int8_t, x_scales.reshape(-1, 1)) in affine_quantized_tensor.py?

The error is results not all close. One element exceed the tolerance by a small amount. As for dtype conversion, I didn't make such changes in affine_quantized_tensor.py 🤔


  E1018 18:45:40.023529 436 site-packages/torch/_inductor/runtime/triton_heuristics.py:475]     assert lhs.shape[1].value >= 32, "small blocks not supported!"
  E1018 18:45:40.023529 436 site-packages/torch/_inductor/runtime/triton_heuristics.py:475] AssertionError: small blocks not supported!

Thanks for the info. Did you see which test case failed?

SmoothQuant using tensor subclassing

d34859b

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 8, 2024

Xia-Weiwen added 10 commits October 7, 2024 21:33

Merge branch 'main' into smooth_quant

df5b49f

Update UT

847f1f2

Add SmoothQuant example

f03cfb3

Remove duplicate implementation of int_scaled_matmul for CPU

a2518f1

Update example.py

28fb8ce

Remove unused code

bada2b0

Implement with LinearActivationQuantizedTensor

921efc0

Fix load/save

ad5b97e

Fix device mismatch in observer

f1be01d

Fix fp16 overflow issue in int_scaled_matmul

7ee1f13

jerryzh168 reviewed Oct 11, 2024

View reviewed changes

torchao/prototype/smoothquant/api.py Outdated Show resolved Hide resolved

jerryzh168 reviewed Oct 11, 2024

View reviewed changes

torchao/prototype/smoothquant/api.py Outdated Show resolved Hide resolved

Xia-Weiwen added 8 commits October 10, 2024 23:52

Merge branch 'main' into smooth_quant

c773386

Add linear_activation_scale_quantized.py for torch.compile

427ff73

Quantize act/wei to 7 bit on old CPU platforms

9916113

Fix device mismatch

52260b6

Fix UT failures

ca50fee

Fix UT

3e90789

Don't use torch._int_mm for CPU now because it may overflow

d47fcc1

Remove reduce_range

a195e73

Xia-Weiwen added 5 commits October 14, 2024 07:37

Refine code

6627be1

Remove torch.compile from example

fb981e7

Add torch.compile in example

17c374e

Debug CI failures

bb76de6

Debug CI failures (1)

98b2de1

Xia-Weiwen added 4 commits October 15, 2024 06:46

Debug CI failures (2)

316f5ea

Debug CI failures (3)

b4d8383

Work with torch.compile

aca06d2

Update torchao/kernel/intmm.py

dde7545

Xia-Weiwen added 3 commits October 16, 2024 21:35

Update readme.md

00cfadd

Update readme.md

466d2f1

Debug CI failures (4)

e970a4a

Xia-Weiwen added 5 commits October 17, 2024 21:16

Reimplement with nested tensor subclassing

5e2abbe

Test torch.compile only with PyTorch >= 2.5

90d1b7d

Debug CI failures (5)

03d490b

Debug CI failures (6)

f595ed4

Debug CI failures (7)

2202f69

Xia-Weiwen added 2 commits October 18, 2024 04:06

Use MovingAvg observer for activation; Update UT and readme

6ea8aa8

Revert changes to test_spinquant.py; refine readme

fa1144c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] SmoothQuant using tensor subclassing #1030

[WIP] SmoothQuant using tensor subclassing #1030

Xia-Weiwen commented Oct 8, 2024 •

edited

Loading

pytorch-bot bot commented Oct 8, 2024 •

edited

Loading

jerryzh168 Oct 11, 2024

Xia-Weiwen Oct 14, 2024

jerryzh168 Oct 16, 2024

Xia-Weiwen Oct 17, 2024

Xia-Weiwen commented Oct 14, 2024

jerryzh168 commented Oct 16, 2024 •

edited

Loading

Xia-Weiwen commented Oct 17, 2024

Xia-Weiwen commented Oct 18, 2024 •

edited

Loading

jerryzh168 commented Oct 18, 2024 •

edited

Loading

Xia-Weiwen commented Oct 19, 2024 •

edited

Loading

		return insert_subclass


		def save_smooth_quant_recipe(model: torch.nn.Module, save_path: str) -> Dict[str, torch.Tensor]:

[WIP] SmoothQuant using tensor subclassing #1030

Are you sure you want to change the base?

[WIP] SmoothQuant using tensor subclassing #1030

Conversation

Xia-Weiwen commented Oct 8, 2024 • edited Loading

pytorch-bot bot commented Oct 8, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1030

❌ 1 New Failure

jerryzh168 Oct 11, 2024

Choose a reason for hiding this comment

Xia-Weiwen Oct 14, 2024

Choose a reason for hiding this comment

jerryzh168 Oct 16, 2024

Choose a reason for hiding this comment

Xia-Weiwen Oct 17, 2024

Choose a reason for hiding this comment

Xia-Weiwen commented Oct 14, 2024

jerryzh168 commented Oct 16, 2024 • edited Loading

Xia-Weiwen commented Oct 17, 2024

Xia-Weiwen commented Oct 18, 2024 • edited Loading

jerryzh168 commented Oct 18, 2024 • edited Loading

Xia-Weiwen commented Oct 19, 2024 • edited Loading

Xia-Weiwen commented Oct 8, 2024 •

edited

Loading

pytorch-bot bot commented Oct 8, 2024 •

edited

Loading

jerryzh168 commented Oct 16, 2024 •

edited

Loading

Xia-Weiwen commented Oct 18, 2024 •

edited

Loading

jerryzh168 commented Oct 18, 2024 •

edited

Loading

Xia-Weiwen commented Oct 19, 2024 •

edited

Loading