LLMs Quantization Recipes

Intel® Neural Compressor supported advanced large language models (LLMs) quantization technologies including SmoothQuant (SQ) and Weight-Only Quant (WOQ), and verified a list of LLMs on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch, Intel® Extension for PyTorch and Intel® Extension for Transformers.
This document aims to publish the specific recipes we achieved for the popular LLMs and help users to quickly get an optimized LLM with limited 1% accuracy loss.

Notes:

The quantization algorithms provide by Intel® Neural Compressor and the evaluate functions provide by Intel® Extension for Transformers.

The model list are continuing update, please expect to find more LLMs in the future.

IPEX key models

Models	SQ INT8	WOQ INT8	WOQ INT4
EleutherAI/gpt-j-6b	✔	✔	✔
facebook/opt-1.3b	✔	✔	✔
facebook/opt-30b	✔	✔	✔
meta-llama/Llama-2-7b-hf	✔	✔	✔
meta-llama/Llama-2-13b-hf	✔	✔	✔
meta-llama/Llama-2-70b-hf	✔	✔	✔
tiiuae/falcon-40b	✔	✔	✔

Detail recipes can be found HERE.

Notes:

This model list comes from IPEX.

WOQ INT4 recipes will be published soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm_recipes.md

llm_recipes.md

LLMs Quantization Recipes

IPEX key models

Files

llm_recipes.md

Latest commit

History

llm_recipes.md

File metadata and controls

LLMs Quantization Recipes

IPEX key models