Meta-Llama-3.1-8B-Instruct-NF4

This repository showcases how to load and use a 4-bit quantized Meta-Llama-3.1-Instruct 8B model for text generation with Hugging Face Transformers. Alongside the notebook, the repository also includes a script that walks you through the process of quantizing the model using the powerful QLoRA algorithm. Both the inference and quantization steps are designed to be accessible, even on free-tier Google Colab GPUs.

The NF4 quantized model uses just under 6 GB of VRAM, making it feasible to load and run inference on free-tier Google Colab GPUs. This quantization technique significantly reduces the resource requirements while maintaining the model's performance

Model card: fsaudm/Meta-Llama-3.1-8B-Instruct-NF4

Loading times on Colab:

tokenizer: 3.87 seconds
model: 221.83 seconds (download included)

Requirements

To get started with text generation and quantization, this will install everything you need:

!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q -U bitsandbytes

Usage

You'll need a Hugging Face token to download the models. You can get one here. Hugging Face is an amazing platform hosting millions of models, datasets, and tools for machine learning, so it's definitely worth checking out.

Quick example

prompt = "Talk to me"
generate_response(prompt, model, tokenizer, max_length=100)

Quantization

The Llama-3.1 8B model was quantized to 4-bit precision using the QLoRA algorithm and the bitsandbytes implementation. This script allows you to replicate the quantization on this and other models following the same 4-bit configuration. The quantization_config used:

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    llm_int8_enable_fp32_cpu_offload=True,
    )

You can explore this and other models here.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
Streaming_QLlama3_1.ipynb		Streaming_QLlama3_1.ipynb
bitsandbytes_llama8B-NF4.py		bitsandbytes_llama8B-NF4.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Meta-Llama-3.1-8B-Instruct-NF4

Requirements

Usage

Quick example

Quantization

About

Releases

Packages

Languages

fsaudm/Streaming_QLlama

Folders and files

Latest commit

History

Repository files navigation

Meta-Llama-3.1-8B-Instruct-NF4

Requirements

Usage

Quick example

Quantization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages