support concurrent inference from multiple models #512

mikestaub · 2024-07-23T20:41:05Z

I would like to stream the response from two different LLMs simultaneously

CharlieFRuan · 2024-07-23T20:48:36Z

Thanks for the request. Having multiple models in a single engine simultaneously is something we are looking into now. Meanwhile, would having two MLCEngine work for your case?

mikestaub · 2024-07-23T21:24:42Z

Yes that should work, assuming the device has enough resources, is this possible today? Is there an example I can play with?

CharlieFRuan · 2024-08-13T20:20:44Z

Hi @mikestaub, from npm 0.2.60, a single engine can load multiple models, and the models can process requests concurrently. However, I have not tested the performance benefit (if any) to process requests simultaneously, as opposed to sequentially. Though being able to load multiple models definitely brings convenience, making the engine behave like an endpoint like OpenAI(), assuming enough resources from the device.

Note: each model can still only process one request at a time (i.e. concurrent batching is not supported).

The two main related PRs are:

[API][Engine] Support loading multiple models in a single engine #542
- Main changes needed to support loading multiple models in an engine
[Fix] Allow concurrent inference for multi model in WebWorker #546
- A patch to the PR above to support simultaneously processing/streaming response

See examples/multi-models for an example, which has the effect below with parallelGeneration():

web-llm-multi-models.mov

CharlieFRuan · 2024-08-23T17:44:22Z

Closing this issue as completed. Feel free to reopen/open new ones if issues arise!

CharlieFRuan closed this as completed Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support concurrent inference from multiple models #512

support concurrent inference from multiple models #512

mikestaub commented Jul 23, 2024

CharlieFRuan commented Jul 23, 2024

mikestaub commented Jul 23, 2024

CharlieFRuan commented Aug 13, 2024

CharlieFRuan commented Aug 23, 2024

support concurrent inference from multiple models #512

support concurrent inference from multiple models #512

Comments

mikestaub commented Jul 23, 2024

CharlieFRuan commented Jul 23, 2024

mikestaub commented Jul 23, 2024

CharlieFRuan commented Aug 13, 2024

CharlieFRuan commented Aug 23, 2024