AWQ Support #458

maktukmak · 2024-11-04T23:33:37Z

This PR enables loading AWQ quantized models and running weight-only quantized inference on HPU.

Currently, it works only for BF16 inference due to kernel torch.ops.hpu.convert_from_uint4 not supporting FP16.

Tested on TheBloke/Llama-2-70B-Chat-AWQ and worked.

michalkuligowski · 2024-11-06T13:27:49Z

vllm/model_executor/layers/quantization/awq_hpu.py

@@ -0,0 +1,222 @@
+from typing import Any, Dict, List, Optional


@kzawora-intel Shouldnt it be added to vllm-hpu-extension?

CUDA class file https://github.com/HabanaAI/vllm-fork/blob/f2fd3586394a3b08fc5d26164a702d0fd4e1d6e7/vllm/model_executor/layers/quantization/awq.py also has these imports

vllm/_custom_ops.py

maktukmak added 2 commits November 4, 2024 23:32

awq enabled

cb35c09

style check, add hpu class

f2fd358

michalkuligowski reviewed Nov 6, 2024

View reviewed changes

vllm/_custom_ops.py Show resolved Hide resolved

michalkuligowski requested a review from kzawora-intel November 6, 2024 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWQ Support #458

AWQ Support #458

maktukmak commented Nov 4, 2024

michalkuligowski Nov 6, 2024 •

edited

Loading

maktukmak Nov 6, 2024

		@@ -0,0 +1,222 @@
		from typing import Any, Dict, List, Optional

AWQ Support #458

Are you sure you want to change the base?

AWQ Support #458

Conversation

maktukmak commented Nov 4, 2024

michalkuligowski Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

maktukmak Nov 6, 2024

Choose a reason for hiding this comment

michalkuligowski Nov 6, 2024 •

edited

Loading