-
Notifications
You must be signed in to change notification settings - Fork 855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support concurrent inference from multiple models #512
Comments
Thanks for the request. Having multiple models in a single engine simultaneously is something we are looking into now. Meanwhile, would having two |
Yes that should work, assuming the device has enough resources, is this possible today? Is there an example I can play with? |
Hi @mikestaub, from npm 0.2.60, a single engine can load multiple models, and the models can process requests concurrently. However, I have not tested the performance benefit (if any) to process requests simultaneously, as opposed to sequentially. Though being able to load multiple models definitely brings convenience, making the engine behave like an endpoint like Note: each model can still only process one request at a time (i.e. concurrent batching is not supported). The two main related PRs are:
See web-llm-multi-models.mov |
Closing this issue as completed. Feel free to reopen/open new ones if issues arise! |
I would like to stream the response from two different LLMs simultaneously
The text was updated successfully, but these errors were encountered: