Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add serving example for model multiplexing using Ray #663

Open
ratnopamc opened this issue Sep 25, 2024 · 0 comments
Open

Add serving example for model multiplexing using Ray #663

ratnopamc opened this issue Sep 25, 2024 · 0 comments
Assignees

Comments

@ratnopamc
Copy link
Collaborator

Add serving example of model multiplexing using Ray.

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

What is the outcome that you are trying to reach?

Describe the solution you would like

Model multiplexing is a powerful technique that enables efficient inference serving for Generative AI models. By co-locating multiple models on the same GPU resources, model multiplexing optimizes hardware utilization and reduces inference latency.

Add serving script using Ray and vLLM that demonstrates usage of model multiplexing on GPU-s.

Describe alternatives you have considered

Additional context

@ratnopamc ratnopamc self-assigned this Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant