[Feature] Add support for streaming LLM output #424

AlessandroSpallina · 2023-08-21T12:07:34Z

Is your feature request related to a problem? Please describe.
As a user who wants to chat with the Cat using self-hosted LLMs I would like to have a smoother experience than the one I have now, self-hosted LLMs are generally slower than the one hosted with a premium subscription, waiting tens of seconds before starting to read the first word is killing the user experience and the user's patience. According to my inference timing tests, on my hardware one single LLM inference takes 8-19 seconds with llama 2 13b (4 to 8-bit quantization), as a person who is chatting with a bot, I would prefer to start reading the output the LLM is providing instead of waiting 19 seconds and then have the full block of text.

Describe the solution you'd like
Enable support for output stream, LangChain already supports this option.

Describe alternatives you've considered
An alternative would be to use lighter models, but this will affect the LLM performances and I don't want it.

pieroit · 2023-08-21T12:14:05Z

Problems:
1 - the Cat contains an agent, so every reply may contain between 1 and 3 different generations. Which one would you stream and how?
2 - not all LLMs and their APIs support streaming

AlessandroSpallina · 2023-08-21T18:41:32Z

1 - let’s only stream to the frontend the last output, i.e. the one the user will read.
2 - if streaming is not supported, in that case only, the user will see directly the block of text; streaming otherwise

KevinZhang19870314 · 2023-08-22T02:00:19Z

Take a look for this project, it based on LangChain & FastAPI, maybe can help you to get some ideas.

streaming_mode – The streaming mode to use for the route.

pieroit · 2023-08-22T07:14:49Z

@AlessandroSpallina @KevinZhang19870314 I'm reading some code. Let's see how much of a refactor this could be

nicola-corbellini · 2023-10-27T08:26:41Z

This was solved with #492

AlessandroSpallina added the enhancement New feature or request label Aug 21, 2023

nicola-corbellini closed this as completed Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add support for streaming LLM output #424

[Feature] Add support for streaming LLM output #424

AlessandroSpallina commented Aug 21, 2023

pieroit commented Aug 21, 2023

AlessandroSpallina commented Aug 21, 2023

KevinZhang19870314 commented Aug 22, 2023

pieroit commented Aug 22, 2023 •

edited

Loading

nicola-corbellini commented Oct 27, 2023

[Feature] Add support for streaming LLM output #424

[Feature] Add support for streaming LLM output #424

Comments

AlessandroSpallina commented Aug 21, 2023

pieroit commented Aug 21, 2023

AlessandroSpallina commented Aug 21, 2023

KevinZhang19870314 commented Aug 22, 2023

pieroit commented Aug 22, 2023 • edited Loading

nicola-corbellini commented Oct 27, 2023

pieroit commented Aug 22, 2023 •

edited

Loading