-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Add support for streaming LLM output #424
Comments
Problems: |
1 - let’s only stream to the frontend the last output, i.e. the one the user will read. |
Take a look for this project, it based on LangChain & FastAPI, maybe can help you to get some ideas.
|
@AlessandroSpallina @KevinZhang19870314 I'm reading some code. Let's see how much of a refactor this could be |
This was solved with #492 |
Is your feature request related to a problem? Please describe.
As a user who wants to chat with the Cat using self-hosted LLMs I would like to have a smoother experience than the one I have now, self-hosted LLMs are generally slower than the one hosted with a premium subscription, waiting tens of seconds before starting to read the first word is killing the user experience and the user's patience. According to my inference timing tests, on my hardware one single LLM inference takes 8-19 seconds with llama 2 13b (4 to 8-bit quantization), as a person who is chatting with a bot, I would prefer to start reading the output the LLM is providing instead of waiting 19 seconds and then have the full block of text.
Describe the solution you'd like
Enable support for output stream, LangChain already supports this option.
Describe alternatives you've considered
An alternative would be to use lighter models, but this will affect the LLM performances and I don't want it.
The text was updated successfully, but these errors were encountered: