This repo is being actively updated.
- Blast-Llama-4B is now available on Hugging Face! 🤗
- arXiv version is available!
- The paper is accepted to NeurIPS 2024.
The packages can be installed via conda env create --file environment.yml
.
Additionally, install lm-evaluation-harness
with BLAST implementation via
cd lm-evaluation-harness
pip install -e .
Blast-Llama-4B is a Llama-7B model compressed by 50% via the procedure described below.
The model can be loaded using transformers
library.
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
model = AutoModelForCausalLM.from_pretrained("cwoolee/blast-llama-4B", trust_remote_code=True)
Run bash ./scripts/decompose_llama.sh 0-31
.
Run bash ./scripts/train_blast.sh
. The script assumes that 4 gpus are available.
We re-trained the compressed Llama model for 400 steps on a subset of SlimPajama dataset available at here.
Run bash scripts/lm-eval-blast.sh
.
This repo is highly inspired by huggingface/transformers and EleutherAI/lm-evaluation-harness .
Please cite our paper if you find this repo or our paper useful
@inproceedings{
lee2024blast,
title={{BLAST}: Block-Level Adaptive Structured Matrices for Efficient Deep Neural Network Inference},
author={Lee, Changwoo and Kwon, Soo Min and Qu, Qing and Kim, Hun-Seok},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
}