Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix Half-Quadratic Quantization and Dequantization on CPU #873

Merged
merged 8 commits into from
Oct 28, 2024

Conversation

haricot
Copy link
Contributor

@haricot haricot commented Oct 21, 2024

This confirms that test_bitpack is running solely on non-CPU hardware. To address this, we could implement a fix by ensuring contiguous data slices.

Copy link

github-actions bot commented Oct 21, 2024

Code Metrics Report
  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                2           35           28            0            7
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                   12          105          104            0            1
 Python                 52         2280         1940           68          272
 TOML                   20          630          564            2           64
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       4            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               38         2803            0         2132          671
 |- BASH                 6          103          100            0            3
 |- JSON                 1           12           12            0            0
 |- Python               5           92           82            0           10
 |- Rust                 9          322          274            0           48
 |- TOML                 2           75           63            0           12
 (Total)                           3407          531         2132          744
-------------------------------------------------------------------------------
 Rust                  271        79722        71594         1674         6454
 |- Markdown           132         1361           25         1241           95
 (Total)                          81083        71619         2915         6549
===============================================================================
 Total                 404        86072        74643         3878         7551
===============================================================================
  

@haricot haricot changed the title Optimizing HQQ quantization on CPU Optimizing Half-Quadratic Quantization on CPU Oct 21, 2024
Copy link
Owner

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @haricot! Thanks for the PR. Can you please update it so it also tests 8 bit quantization? Thanks!

@EricLBuehler
Copy link
Owner

@haricot were you planning on implementing HQQ for non-CUDA devices in this PR? The name seems to indicate so, I was just wondering!

@haricot
Copy link
Contributor Author

haricot commented Oct 22, 2024

Hi @EricLBuehler!

My first goal was to make the quantization work on my device. In fact, I could not quantize the models, I got OOM.
With your models quantized on GPU, it worked correctly with 8GB of VRAM:
cargo run -r --F cuda -- --pa-gpu-mem-usage 0.5 -i plain -m '/local_model_path/' --dtype f16 --from-uqff /model/llm-hqq4.uqff
With the model quantized on CPU, the inferences on CPU or GPU, I got inconsistent texts. After correct it , I realized that the inferences on CPU with the quantized models were not optimal and that this is initially done to be optimized on gpu.

There is a small optimisation because in the dequantize function if the scales and zeros are in f32 then it dequantizes to f32 even if we use other dtype.
mistralrs-quant/src/hqq/quantize.rs#L15
if we change this it will use the correct dtypes on cpu or gpu:

let this = Self {
    w_q: quant_w,
    zeros: zero.to_device(device)?.to_dtype(dtype)?,
    scales: (1.0 / scale)?.to_device(device)?.to_dtype(dtype)?,
    bias: None,
    w_shape: input.shape().clone(),
    cfg,
};

This would mean that all the dtype scales should be integrated into the uqff format or only the specific or, more simply, the dtypes should be changed dynamically depending on the possible dtype chosen.

@haricot haricot changed the title Optimizing Half-Quadratic Quantization on CPU fix Half-Quadratic Quantization and Dequantization on CPU Oct 22, 2024
Copy link
Owner

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @haricot! All tests pass on CPU (the target of this PR) and the changes look good. Merging now, thanks for the contribution.

@EricLBuehler EricLBuehler merged commit 76b98e9 into EricLBuehler:master Oct 28, 2024
12 checks passed
Aveline67 pushed a commit to Aveline67/mistral.rs that referenced this pull request Nov 7, 2024
…er#873)

* test_bitpack cpu/cuda

* add test_bitpack 8 bit quantization cpu/cuda

* fix unnecessary nested cfg attributes

* fix alloc/init cpu dequantize hqq

* ensuring contiguous data slices

* Revert "ensuring contiguous data slices to see result in CI"

* code cleanup

* ensuring contiguous data slices
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants