Intel® Neural Compressor supported advanced large language models (LLMs) quantization technologies including SmoothQuant (SQ) and Weight-Only Quant (WOQ),
and verified a list of LLMs on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch,
Intel® Extension for PyTorch and Intel® Extension for Transformers.
This document aims to publish the specific recipes we achieved for the popular LLMs and help users to quickly get an optimized LLM with limited 1% accuracy loss.
Notes:
- The quantization algorithms provide by Intel® Neural Compressor and the evaluate functions provide by Intel® Extension for Transformers.
- The model list are continuing update, please expect to find more LLMs in the future.
Models | SQ INT8 | WOQ INT8 | WOQ INT4 |
---|---|---|---|
EleutherAI/gpt-j-6b | ✔ | ✔ | ✔ |
facebook/opt-1.3b | ✔ | ✔ | ✔ |
facebook/opt-30b | ✔ | ✔ | ✔ |
meta-llama/Llama-2-7b-hf | ✔ | ✔ | ✔ |
meta-llama/Llama-2-13b-hf | ✔ | ✔ | ✔ |
meta-llama/Llama-2-70b-hf | ✔ | ✔ | ✔ |
tiiuae/falcon-40b | ✔ | ✔ | ✔ |
Detail recipes can be found HERE.
Notes:
- This model list comes from IPEX.
- WOQ INT4 recipes will be published soon.