Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] SmoothQuant using tensor subclassing #1030

Draft
wants to merge 38 commits into
base: main
Choose a base branch
from

Conversation

Xia-Weiwen
Copy link
Collaborator

@Xia-Weiwen Xia-Weiwen commented Oct 8, 2024

Still WIP

The implementation of SmoothQuant with tensor subclassing (AffineQuantizedTensor) is similar to that of AWQ with the following differences:

  • SmoothQuant supports both static and dynamic quantization of activation while AWQ only uses dynamic
  • Matmul is computed in int8 instead of floating point (at least at op level)
  • The smoothing factor is calculated differently from the equalization scales of AWQ

Copy link

pytorch-bot bot commented Oct 8, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1030

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit fa1144c with merge base d4b2f33 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 8, 2024
return insert_subclass


def save_smooth_quant_recipe(model: torch.nn.Module, save_path: str) -> Dict[str, torch.Tensor]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this? or just saving the state_dict for observed model is enough?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to have an API to modify (tune) quantization parameters, i.e. the recipe here. Do you have any concern about adding this API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the state_dict is supposed to be used by other APIs to tune quantization parameters? I think that's fine if you have this use case in mind, is the model with SmoothQuantObservedLinear not serializable by itself?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SmoothQuantObservedLinear is serializable. However, a recipe is more flexible to tune parameters. Thanks.

@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 I added a new tensor subclass LinearActivationScaleQuantizedTensor to support x -> x/scale -> quantize x for torch.compile.

If I use LinearActivationQuantizedTensor, the x/sacle is done outside the class (by input_quant_func) and there is a dynamo error about scale during torch.compile. I guess it's because the scale tensors are not on the graph in this case. Putting the scale in the weight tensor solves the problem. And WeightTensorWithLinearActivationScaleMetadata does not quantize activation.

Do you have any concern adding this new class? Thanks.

@jerryzh168
Copy link
Contributor

jerryzh168 commented Oct 16, 2024

Hi @jerryzh168 I added a new tensor subclass LinearActivationScaleQuantizedTensor to support x -> x/scale -> quantize x for torch.compile.

If I use LinearActivationQuantizedTensor, the x/sacle is done outside the class (by input_quant_func) and there is a dynamo error about scale during torch.compile. I guess it's because the scale tensors are not on the graph in this case. Putting the scale in the weight tensor solves the problem. And WeightTensorWithLinearActivationScaleMetadata does not quantize activation.

Do you have any concern adding this new class? Thanks.

I think in this case we should be composing WeightTensorWithLinearActivationScaleMetadata LinearActivationQuantizedTensor together, i.e.

weight = to_affine_quantized(float_weight, ...)
# this will quantize input
# use https://github.com/pytorch/ao/blob/c87cc9b7286a46e9dfc076fa2417eb9b64ccc807/torchao/quantization/weight_tensor_linear_activation_quantization.py#L13 for static quantization
weight = to_linear_activation_quantized_tensor(weight)  # dynamic quant
# this will do x / scale
weight = to_weight_tensor_with_linear_activation_scale_metadata(weight)

in dispatch time, we first unwrap the outer most tensor subclass, which will be WeightTensorWithLinearActivationScaleMetadata, so we'll apply scale to activation, then LinearActivationQuantizedTensor, which will quantize activation, and then AffineQuantizedTensor

would this work?

the naming for different tensor subclasses is a bit confusing right now I think, we should cleanup a bit later

@Xia-Weiwen
Copy link
Collaborator Author

I think in this case we should be composing WeightTensorWithLinearActivationScaleMetadata LinearActivationQuantizedTensor together, i.e.

weight = to_affine_quantized(float_weight, ...)
# this will quantize input
# use https://github.com/pytorch/ao/blob/c87cc9b7286a46e9dfc076fa2417eb9b64ccc807/torchao/quantization/weight_tensor_linear_activation_quantization.py#L13 for static quantization
weight = to_linear_activation_quantized_tensor(weight)  # dynamic quant
# this will do x / scale
weight = to_weight_tensor_with_linear_activation_scale_metadata(weight)

in dispatch time, we first unwrap the outer most tensor subclass, which will be WeightTensorWithLinearActivationScaleMetadata, so we'll apply scale to activation, then LinearActivationQuantizedTensor, which will quantize activation, and then AffineQuantizedTensor

would this work?

the naming for different tensor subclasses is a bit confusing right now I think, we should cleanup a bit later

It works. Thanks

@Xia-Weiwen
Copy link
Collaborator Author

Xia-Weiwen commented Oct 18, 2024

Hi @jerryzh168 It's weird that if I add these lines https://github.com/pytorch/ao/blob/f595ed41b99685cc16fc480ca2218965bb812bed/torchao/kernel/intmm.py#L142C1-L146C1 to avoid overflow of float16, there will be a failure in test_spinquant.py, but the test does not use float16 at all. And there is another failure with CUDA nightly, but its log cannot be loaded. And these failures cannot be reproduced in my local environment or on an AWS instance.
So, I have to remove these line and also remove tests for float16. If users try to run in fp16, they will get overflow as well. Do you have any suggestions? Thanks.

@jerryzh168
Copy link
Contributor

jerryzh168 commented Oct 18, 2024

Hi @jerryzh168 It's weird that if I add these lines f595ed4/torchao/kernel/intmm.py#L142C1-L146C1 to avoid overflow of float16, there will be a failure in test_spinquant.py, but the test does not use float16 at all.

what is the test failure? is it possible to do the dtype conversion before calling int_scaled_matmul, e.g. before
y_dot_scaled = int_scaled_matmul(tmp, w_vals_int8_t, x_scales.reshape(-1, 1)) in affine_quantized_tensor.py?

And there is another failure with CUDA nightly, but its log cannot be loaded. And these failures cannot be reproduced in my local environment or on an AWS instance. So, I have to remove these line and also remove tests for float16. If users try to run in fp16, they will get overflow as well. Do you have any suggestions? Thanks.

I just saw the error, it is talking about some triton error:


  E1018 18:45:40.023529 436 site-packages/torch/_inductor/runtime/triton_heuristics.py:475]     assert lhs.shape[1].value >= 32, "small blocks not supported!"
  E1018 18:45:40.023529 436 site-packages/torch/_inductor/runtime/triton_heuristics.py:475] AssertionError: small blocks not supported!

@Xia-Weiwen
Copy link
Collaborator Author

Xia-Weiwen commented Oct 19, 2024

what is the test failure? is it possible to do the dtype conversion before calling int_scaled_matmul, e.g. before y_dot_scaled = int_scaled_matmul(tmp, w_vals_int8_t, x_scales.reshape(-1, 1)) in affine_quantized_tensor.py?

The error is results not all close. One element exceed the tolerance by a small amount. As for dtype conversion, I didn't make such changes in affine_quantized_tensor.py 🤔


  E1018 18:45:40.023529 436 site-packages/torch/_inductor/runtime/triton_heuristics.py:475]     assert lhs.shape[1].value >= 32, "small blocks not supported!"
  E1018 18:45:40.023529 436 site-packages/torch/_inductor/runtime/triton_heuristics.py:475] AssertionError: small blocks not supported!

Thanks for the info. Did you see which test case failed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants