Sweating a RTX 4090: Switching between multiple LLMs (from the API or using a "router") #803

Speedway1 · 2023-12-19T23:22:56Z

Speedway1
Dec 19, 2023

What is the best way to switch between different LLMs (e.g. Orca-2 (AWQ) and Mistral-7B) on the same GPU (RTX 4090)?

I am calling the /v1/generate endpoint from external software. For each task, I want to run the LLM that I've identified as best for that task (coding is different to, say email summaries or generating help text). However the 4090 can't fit all the Models in the GPU VRAM (24GB limit).

At the moment we're looking at either "manually" killing openllm and restarting with the different LLM (openllm start) or else using docker to more or less do the same thing. It seems a waste to be reloading all the shards or stopping and starting docker images (in order to force GPU VRAM unloading when switching models).

Can someone point me in the right direction if this has already been solved?

Thank you,
Brad

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sweating a RTX 4090: Switching between multiple LLMs (from the API or using a "router") #803

{{title}}

Replies: 0 comments

Select a reply

Sweating a RTX 4090: Switching between multiple LLMs (from the API or using a "router") #803

Speedway1 Dec 19, 2023

Replies: 0 comments

Speedway1
Dec 19, 2023