You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What is the best way to switch between different LLMs (e.g. Orca-2 (AWQ) and Mistral-7B) on the same GPU (RTX 4090)?
I am calling the /v1/generate endpoint from external software. For each task, I want to run the LLM that I've identified as best for that task (coding is different to, say email summaries or generating help text). However the 4090 can't fit all the Models in the GPU VRAM (24GB limit).
At the moment we're looking at either "manually" killing openllm and restarting with the different LLM (openllm start) or else using docker to more or less do the same thing. It seems a waste to be reloading all the shards or stopping and starting docker images (in order to force GPU VRAM unloading when switching models).
Can someone point me in the right direction if this has already been solved?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
What is the best way to switch between different LLMs (e.g. Orca-2 (AWQ) and Mistral-7B) on the same GPU (RTX 4090)?
I am calling the /v1/generate endpoint from external software. For each task, I want to run the LLM that I've identified as best for that task (coding is different to, say email summaries or generating help text). However the 4090 can't fit all the Models in the GPU VRAM (24GB limit).
At the moment we're looking at either "manually" killing openllm and restarting with the different LLM (openllm start) or else using docker to more or less do the same thing. It seems a waste to be reloading all the shards or stopping and starting docker images (in order to force GPU VRAM unloading when switching models).
Can someone point me in the right direction if this has already been solved?
Thank you,
Brad
Beta Was this translation helpful? Give feedback.
All reactions