-
Notifications
You must be signed in to change notification settings - Fork 957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: How to use quantized tensors? #1006
Comments
Well it's actually on purpose that we only have Re quantized tensor usage, we have quite a bunch of examples now, see in the models directory, all the ones that start with Finally for creating the quantized weights we use cargo run --example tensor-tools --release -- \
quantize --quantization q4k \
$HOME/tmp/mistral/pytorch_model-00001-of-00002.safetensors \
$HOME/tmp/mistral/pytorch_model-00002-of-00002.safetensors \
--out-file /tmp/model.gguf |
Hi @LaurentMazare, that is interesting. I have 3 questions:
Thank you so much. |
So what I meant by "operations are all carried in f32 except the matmul" is that there is no way to add to |
Ok, thanks for the clarification about Can you please give an example of how I would add a |
@EricLBuehler Nearly all quantization formats (e.g. |
That makes sense. What is the cost of dequantizing - are there metrics on that? |
I only measured it from the python side, through the wrapper. For a (1024, 1024) matrix the wrapper needs ~1ms, if you directly dequantize from the rust side it will probably be faster, but you should probably cache the dequantized tensors somewhere. Maybe we could create some sort of benchmark, to test de-/quantization performance? |
Interesting. This is probably too slow to do in a model layer (I wanted to do it in my LoRA layers). I was looking into implementing a quantized option for LoRA layers, although that seems difficult with this new information. Do you have any ideas about how to implement quantized LoRA layers if there is no sum operation? |
Well i took a short look at |
Yes, that is what I do too. However, here, there is |
Oh alright, a |
Thank you, that makes sense. Where did you find that method signature for Candle, I could only find one for Vec? |
It's a bit hidden in the |
Thank you - I did not see that! This resolves my questions, so I will now close the issue. |
This makes sense when you are loading a model in 32-bit. But what about when the model file is already quantized to 16-bit for example? Are you simply expected to dequantize all the weights just to quantize them once more? |
@NatanFreeman are you referring to 16-bit GGUF files? In that case (assuming you're not using |
Hello everybody,
I was looking through Candle's quantized tensor code when I noticed that there is only a matmul_t implemented for QuantizedType, and no other operations. Perhaps other could operations be added?
In addition, is there an example of using quantized tensors/converting them from normal tensors?
Thanks!
The text was updated successfully, but these errors were encountered: