You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The offline inference of Llama-3-8B with benchmark_latency.py sweeping on 1, 2, 4 cards results:
And the optimum-habana results:
The results show that on 1 card vLLM is greater than optimum-habana. But when inference on multi-card, the TP in vLLM performance gain is not good enough, so that the performance is worse than optimum-habana.
How would you like to use vllm
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.
The text was updated successfully, but these errors were encountered:
Your current environment
The offline inference of Llama-3-8B with benchmark_latency.py sweeping on 1, 2, 4 cards results:
And the optimum-habana results:
The results show that on 1 card vLLM is greater than optimum-habana. But when inference on multi-card, the TP in vLLM performance gain is not good enough, so that the performance is worse than optimum-habana.
How would you like to use vllm
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.
The text was updated successfully, but these errors were encountered: