Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add support for streaming LLM output #424

Closed
AlessandroSpallina opened this issue Aug 21, 2023 · 5 comments
Closed

[Feature] Add support for streaming LLM output #424

AlessandroSpallina opened this issue Aug 21, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@AlessandroSpallina
Copy link
Member

Is your feature request related to a problem? Please describe.
As a user who wants to chat with the Cat using self-hosted LLMs I would like to have a smoother experience than the one I have now, self-hosted LLMs are generally slower than the one hosted with a premium subscription, waiting tens of seconds before starting to read the first word is killing the user experience and the user's patience. According to my inference timing tests, on my hardware one single LLM inference takes 8-19 seconds with llama 2 13b (4 to 8-bit quantization), as a person who is chatting with a bot, I would prefer to start reading the output the LLM is providing instead of waiting 19 seconds and then have the full block of text.

Describe the solution you'd like
Enable support for output stream, LangChain already supports this option.

Describe alternatives you've considered
An alternative would be to use lighter models, but this will affect the LLM performances and I don't want it.

@AlessandroSpallina AlessandroSpallina added the enhancement New feature or request label Aug 21, 2023
@pieroit
Copy link
Member

pieroit commented Aug 21, 2023

Problems:
1 - the Cat contains an agent, so every reply may contain between 1 and 3 different generations. Which one would you stream and how?
2 - not all LLMs and their APIs support streaming

@AlessandroSpallina
Copy link
Member Author

1 - let’s only stream to the frontend the last output, i.e. the one the user will read.
2 - if streaming is not supported, in that case only, the user will see directly the block of text; streaming otherwise

@KevinZhang19870314
Copy link

Take a look for this project, it based on LangChain & FastAPI, maybe can help you to get some ideas.

streaming_mode – The streaming mode to use for the route.

@pieroit
Copy link
Member

pieroit commented Aug 22, 2023

@AlessandroSpallina @KevinZhang19870314 I'm reading some code. Let's see how much of a refactor this could be

@nicola-corbellini
Copy link
Member

This was solved with #492

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants