Question: How to use quantized tensors? #1006

EricLBuehler · 2023-09-30T13:35:16Z

Hello everybody,

I was looking through Candle's quantized tensor code when I noticed that there is only a matmul_t implemented for QuantizedType, and no other operations. Perhaps other could operations be added?

In addition, is there an example of using quantized tensors/converting them from normal tensors?

Thanks!

LaurentMazare · 2023-10-01T06:52:21Z

Well it's actually on purpose that we only have matmul_t, the idea here is the same as what is done in llama.cpp: operations are all carried in f32 except the matmul which are done using the quantized tensors. As the matmul represent > 95% of the computations in transformer architecture, this result in pretty large speedups. Not only using quantized version of other ops wouldn't make things significantly faster, but it would also introduce lots of numerical noise and the output of the model would likely be garbage. The only place where I would see us adding new conv operations is convolutions if one wanted to take an attempt at quantizing stable diffusion like models.

Re quantized tensor usage, we have quite a bunch of examples now, see in the models directory, all the ones that start with quantized_.

Finally for creating the quantized weights we use tensor-tools, e.g. I ran the following command yesterday to generate the quantized weights for mistral. This tool is fairly limited and the current plan is to expose more of this on the Python side so that it's easy to generate quantized weights and models by using Python on top of candle.

cargo run --example tensor-tools --release -- \
    quantize --quantization q4k \
    $HOME/tmp/mistral/pytorch_model-00001-of-00002.safetensors \
    $HOME/tmp/mistral/pytorch_model-00002-of-00002.safetensors  \
    --out-file /tmp/model.gguf

EricLBuehler · 2023-10-01T09:25:04Z

Hi @LaurentMazare, that is interesting. I have 3 questions:

Q: Is the type of a quantized tensor in the model a Tensor or a QTensor?
- Q: How would I .add two of these tensors?
Q: Is there a way / how can I convert a normal Tensor to a quantized tensor?
- I realize Candle uses tensor-tools in the command line to quantize the safetensors, but I would like to quantize tensors in pure Rust, if possible.

Thank you so much.

LaurentMazare · 2023-10-01T09:32:29Z

So what I meant by "operations are all carried in f32 except the matmul" is that there is no way to add to QTensor (quantized tensors), and there shouldn't be any as the numerical noise of the operation would be atrocious. You should use normal Tensor for everything besides the matmul with the model weight.
tensor-tools is actually written in rust, so you can find the quantization functions being called from there. In this snippet, you see both a dequantization (QTensor -> Tensor) and quantization (Tensor -> QTensor).

EricLBuehler · 2023-10-01T09:48:36Z

Ok, thanks for the clarification about tensor-tools.

Can you please give an example of how I would add a Tensor to a QTensor (without dequantizing)? Thanks!

LLukas22 · 2023-10-02T14:45:06Z

@EricLBuehler Nearly all quantization formats (e.g. AWQ or GPTQ) only provide a quantized matmul. If you want to add a Tensor and a QTensor you have to dequantize the QTensor. Maybe i'm missing something here but i don't know any quantization formats that provide other functions than matmul, as far as i know even bitsandbytes only provides a matmul.

EricLBuehler · 2023-10-02T18:28:46Z

That makes sense. What is the cost of dequantizing - are there metrics on that?

LLukas22 · 2023-10-02T19:04:47Z

I only measured it from the python side, through the wrapper. For a (1024, 1024) matrix the wrapper needs ~1ms, if you directly dequantize from the rust side it will probably be faster, but you should probably cache the dequantized tensors somewhere.

Maybe we could create some sort of benchmark, to test de-/quantization performance?

EricLBuehler · 2023-10-02T21:43:16Z

Interesting. This is probably too slow to do in a model layer (I wanted to do it in my LoRA layers).

I was looking into implementing a quantized option for LoRA layers, although that seems difficult with this new information. Do you have any ideas about how to implement quantized LoRA layers if there is no sum operation?

LLukas22 · 2023-10-03T07:39:02Z

Well i took a short look at peft and it seams their LoRA implementation is just 2 chained Linear Layers without a bias, meaning it's just 2 matmuls, which can be implemented with the quantized tensors. https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/layer.py#L251C12-L262C43

EricLBuehler · 2023-10-03T09:31:24Z

Yes, that is what I do too. However, here, there is result += lora_B(lora_A(dropout(x))) * scaling. From what I understand, I would need to dequantize the linear layers' result to multiply it by the scaling and accumulate it to the result tensor.

LLukas22 · 2023-10-03T09:35:42Z

Oh alright, a quantized matmul will always yield a normal f32 tensor as a result, meaning no dequantization is needed, only the weight of the linear layer is stored in a quantized format. Meaning you can simply multiply by the scaling as the result is already a normal candle tensor.

EricLBuehler · 2023-10-03T09:39:50Z

Thank you, that makes sense. Where did you find that method signature for Candle, I could only find one for Vec?

LLukas22 · 2023-10-03T09:46:08Z

It's a bit hidden in the QMatMul struct: https://github.com/huggingface/candle/blob/main/candle-core/src/quantized/mod.rs#L296-L303

EricLBuehler · 2023-10-03T09:50:09Z

Thank you - I did not see that! This resolves my questions, so I will now close the issue.

NatanFreeman · 2024-08-16T11:45:09Z

Well it's actually on purpose that we only have matmul_t, the idea here is the same as what is done in llama.cpp: operations are all carried in f32 except the matmul which are done using the quantized tensors. As the matmul represent > 95% of the computations in transformer architecture, this result in pretty large speedups. Not only using quantized version of other ops wouldn't make things significantly faster, but it would also introduce lots of numerical noise and the output of the model would likely be garbage. The only place where I would see us adding new conv operations is convolutions if one wanted to take an attempt at quantizing stable diffusion like models.

Re quantized tensor usage, we have quite a bunch of examples now, see in the models directory, all the ones that start with quantized_.

Finally for creating the quantized weights we use tensor-tools, e.g. I ran the following command yesterday to generate the quantized weights for mistral. This tool is fairly limited and the current plan is to expose more of this on the Python side so that it's easy to generate quantized weights and models by using Python on top of candle.
cargo run --example tensor-tools --release -- \
    quantize --quantization q4k \
    $HOME/tmp/mistral/pytorch_model-00001-of-00002.safetensors \
    $HOME/tmp/mistral/pytorch_model-00002-of-00002.safetensors  \
    --out-file /tmp/model.gguf

This makes sense when you are loading a model in 32-bit. But what about when the model file is already quantized to 16-bit for example? Are you simply expected to dequantize all the weights just to quantize them once more?

EricLBuehler · 2024-08-17T15:20:57Z

@NatanFreeman are you referring to 16-bit GGUF files? In that case (assuming you're not using QTensor::dequantize_f16), you will need to call QTensor::dequantize to get a Tensor out. I've just added #2424 which optimizes dequantization for f16/f32 and #2387 will add bf16 GGUF support.

EricLBuehler mentioned this issue Sep 30, 2023

QA-LoRA Implementation and Review EricLBuehler/candle-lora#3

Closed

EricLBuehler closed this as completed Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: How to use quantized tensors? #1006

Question: How to use quantized tensors? #1006

EricLBuehler commented Sep 30, 2023

LaurentMazare commented Oct 1, 2023

EricLBuehler commented Oct 1, 2023

LaurentMazare commented Oct 1, 2023

EricLBuehler commented Oct 1, 2023

LLukas22 commented Oct 2, 2023

EricLBuehler commented Oct 2, 2023 •

edited

Loading

LLukas22 commented Oct 2, 2023

EricLBuehler commented Oct 2, 2023

LLukas22 commented Oct 3, 2023

EricLBuehler commented Oct 3, 2023

LLukas22 commented Oct 3, 2023

EricLBuehler commented Oct 3, 2023

LLukas22 commented Oct 3, 2023

EricLBuehler commented Oct 3, 2023

NatanFreeman commented Aug 16, 2024

EricLBuehler commented Aug 17, 2024

Question: How to use quantized tensors? #1006

Question: How to use quantized tensors? #1006

Comments

EricLBuehler commented Sep 30, 2023

LaurentMazare commented Oct 1, 2023

EricLBuehler commented Oct 1, 2023

LaurentMazare commented Oct 1, 2023

EricLBuehler commented Oct 1, 2023

LLukas22 commented Oct 2, 2023

EricLBuehler commented Oct 2, 2023 • edited Loading

LLukas22 commented Oct 2, 2023

EricLBuehler commented Oct 2, 2023

LLukas22 commented Oct 3, 2023

EricLBuehler commented Oct 3, 2023

LLukas22 commented Oct 3, 2023

EricLBuehler commented Oct 3, 2023

LLukas22 commented Oct 3, 2023

EricLBuehler commented Oct 3, 2023

NatanFreeman commented Aug 16, 2024

EricLBuehler commented Aug 17, 2024

EricLBuehler commented Oct 2, 2023 •

edited

Loading