You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docker image: ghcr.io/predibase/lorax:07addea because main image isn't working on latest drivers
device: Nvidia A100 80GB
models in use: meta-llama/Meta-Llama-3.1-8B-Instruct and hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
loras finetuned with LLaMA-Factory
Information
Docker
The CLI directly
Tasks
An officially supported command
My own modifications
Reproduction
When testing with a batch of 6 concurrent requests,
Base model meta-llama/Meta-Llama-3.1-8B-Instruct takes 17-20 ms/token i.e. ~ 55 tokens/sec
AWQ quantized model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 takes 23-27 ms/token ~ 48 tokens/sec
With 3 loras (2 request for each lora), the above base model takes 38-46 ms/token ~ 25 tokens/sec
Expected behavior
AWQ quantized model are slower than base models.:
I was expecting it to be at least faster than the base model, but it's rather slower. The memory footprint is smaller though. The base model took 20.2 GB to load while the AWQ model took 12.2GB (before the rest of the memory was mostly reserved). For the same models (base, awq quantied), the throughput on sglang (with cuda graph, radix and marlin_awq kernel) are 78 tokens/sec and 150 tokens/sec respectively, compared to 55 and 48 tokens/sec here.
Loras are almost twice as slow as the base model. :
I was expecting it to be slower than base, but doing 30 tokens/sec (when base does 60 tokens/sec) on Nvidia A100 for Llama-3.1-8B-Instruct + lora was surprising. sglang also added lora support recently, but it doesn't support any optimizations and givens an abysmal 10 tokens/sec with loras.
The text was updated successfully, but these errors were encountered:
System Info
docker image:
ghcr.io/predibase/lorax:07addea
because main image isn't working on latest driversdevice: Nvidia A100 80GB
models in use:
meta-llama/Meta-Llama-3.1-8B-Instruct
andhugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
loras finetuned with LLaMA-Factory
Information
Tasks
Reproduction
When testing with a batch of 6 concurrent requests,
meta-llama/Meta-Llama-3.1-8B-Instruct
takes 17-20 ms/token i.e. ~ 55 tokens/sechugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
takes 23-27 ms/token ~ 48 tokens/secExpected behavior
AWQ quantized model are slower than base models.:
I was expecting it to be at least faster than the base model, but it's rather slower. The memory footprint is smaller though. The base model took 20.2 GB to load while the AWQ model took 12.2GB (before the rest of the memory was mostly reserved). For the same models (base, awq quantied), the throughput on sglang (with cuda graph, radix and marlin_awq kernel) are 78 tokens/sec and 150 tokens/sec respectively, compared to 55 and 48 tokens/sec here.
Loras are almost twice as slow as the base model. :
I was expecting it to be slower than base, but doing 30 tokens/sec (when base does 60 tokens/sec) on Nvidia A100 for Llama-3.1-8B-Instruct + lora was surprising. sglang also added lora support recently, but it doesn't support any optimizations and givens an abysmal 10 tokens/sec with loras.
The text was updated successfully, but these errors were encountered: