fix Half-Quadratic Quantization and Dequantization on CPU #873

haricot · 2024-10-21T22:03:17Z

This confirms that test_bitpack is running solely on non-CPU hardware. To address this, we could implement a fix by ensuring contiguous data slices.

github-actions · 2024-10-21T22:04:23Z

Code Metrics Report

  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                2           35           28            0            7
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                   12          105          104            0            1
 Python                 52         2280         1940           68          272
 TOML                   20          630          564            2           64
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       4            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               38         2803            0         2132          671
 |- BASH                 6          103          100            0            3
 |- JSON                 1           12           12            0            0
 |- Python               5           92           82            0           10
 |- Rust                 9          322          274            0           48
 |- TOML                 2           75           63            0           12
 (Total)                           3407          531         2132          744
-------------------------------------------------------------------------------
 Rust                  271        79722        71594         1674         6454
 |- Markdown           132         1361           25         1241           95
 (Total)                          81083        71619         2915         6549
===============================================================================
 Total                 404        86072        74643         3878         7551
===============================================================================

EricLBuehler

Hi @haricot! Thanks for the PR. Can you please update it so it also tests 8 bit quantization? Thanks!

EricLBuehler · 2024-10-22T08:56:59Z

@haricot were you planning on implementing HQQ for non-CUDA devices in this PR? The name seems to indicate so, I was just wondering!

haricot · 2024-10-22T10:58:05Z

Hi @EricLBuehler!

My first goal was to make the quantization work on my device. In fact, I could not quantize the models, I got OOM.
With your models quantized on GPU, it worked correctly with 8GB of VRAM:
cargo run -r --F cuda -- --pa-gpu-mem-usage 0.5 -i plain -m '/local_model_path/' --dtype f16 --from-uqff /model/llm-hqq4.uqff
With the model quantized on CPU, the inferences on CPU or GPU, I got inconsistent texts. After correct it , I realized that the inferences on CPU with the quantized models were not optimal and that this is initially done to be optimized on gpu.

There is a small optimisation because in the dequantize function if the scales and zeros are in f32 then it dequantizes to f32 even if we use other dtype.
mistralrs-quant/src/hqq/quantize.rs#L15
if we change this it will use the correct dtypes on cpu or gpu:

let this = Self {
    w_q: quant_w,
    zeros: zero.to_device(device)?.to_dtype(dtype)?,
    scales: (1.0 / scale)?.to_device(device)?.to_dtype(dtype)?,
    bias: None,
    w_shape: input.shape().clone(),
    cfg,
};

This would mean that all the dtype scales should be integrated into the uqff format or only the specific or, more simply, the dtypes should be changed dynamically depending on the possible dtype chosen.

EricLBuehler

Hi @haricot! All tests pass on CPU (the target of this PR) and the changes look good. Merging now, thanks for the contribution.

…er#873) * test_bitpack cpu/cuda * add test_bitpack 8 bit quantization cpu/cuda * fix unnecessary nested cfg attributes * fix alloc/init cpu dequantize hqq * ensuring contiguous data slices * Revert "ensuring contiguous data slices to see result in CI" * code cleanup * ensuring contiguous data slices

test_bitpack cpu/cuda

95b2add

haricot changed the title ~~Optimizing HQQ quantization on CPU~~ Optimizing Half-Quadratic Quantization on CPU Oct 21, 2024

EricLBuehler requested changes Oct 22, 2024

View reviewed changes

add test_bitpack 8 bit quantization cpu/cuda

0c76982

haricot added 3 commits October 22, 2024 12:43

fix unnecessary nested cfg attributes

b162370

fix alloc/init cpu dequantize hqq

62b3ae2

ensuring contiguous data slices

f816446

Revert "ensuring contiguous data slices to see result in CI"

c5e7518

haricot changed the title ~~Optimizing Half-Quadratic Quantization on CPU~~ fix Half-Quadratic Quantization and Dequantization on CPU Oct 22, 2024

haricot added 2 commits October 23, 2024 22:51

code cleanup

643e23d

ensuring contiguous data slices

abf349b

EricLBuehler approved these changes Oct 28, 2024

View reviewed changes

EricLBuehler merged commit 76b98e9 into EricLBuehler:master Oct 28, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix Half-Quadratic Quantization and Dequantization on CPU #873

fix Half-Quadratic Quantization and Dequantization on CPU #873

haricot commented Oct 21, 2024

github-actions bot commented Oct 21, 2024 •

edited

Loading

EricLBuehler left a comment

EricLBuehler commented Oct 22, 2024

haricot commented Oct 22, 2024 •

edited

Loading

EricLBuehler left a comment

fix Half-Quadratic Quantization and Dequantization on CPU #873

fix Half-Quadratic Quantization and Dequantization on CPU #873

Conversation

haricot commented Oct 21, 2024

github-actions bot commented Oct 21, 2024 • edited Loading

EricLBuehler left a comment

Choose a reason for hiding this comment

EricLBuehler commented Oct 22, 2024

haricot commented Oct 22, 2024 • edited Loading

EricLBuehler left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 21, 2024 •

edited

Loading

haricot commented Oct 22, 2024 •

edited

Loading