Add serving example for model multiplexing using Ray #663

ratnopamc · 2024-09-25T20:11:59Z

Add serving example of model multiplexing using Ray.

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

What is the outcome that you are trying to reach?

Describe the solution you would like

Model multiplexing is a powerful technique that enables efficient inference serving for Generative AI models. By co-locating multiple models on the same GPU resources, model multiplexing optimizes hardware utilization and reduces inference latency.

Add serving script using Ray and vLLM that demonstrates usage of model multiplexing on GPU-s.

Describe alternatives you have considered

Additional context

ratnopamc self-assigned this Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add serving example for model multiplexing using Ray #663

Add serving example for model multiplexing using Ray #663

ratnopamc commented Sep 25, 2024

Add serving example for model multiplexing using Ray #663

Add serving example for model multiplexing using Ray #663

Comments

ratnopamc commented Sep 25, 2024

Community Note

What is the outcome that you are trying to reach?

Describe the solution you would like

Describe alternatives you have considered

Additional context