-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add prefix caching #581
Add prefix caching #581
Conversation
@tgaddair hi, do you have any document related to this feature? |
@prd-tuong-nguyen not yet, but it's coming! For now, I would say this is an experimental feature that can be helpful when you want to ask multiple questions over the same large input (like a long document): Here's a blog post going over the technical details of how it works: https://flashinfer.ai/2024/02/02/cascade-inference.html |
@tgaddair Yeah, thank you. It's an awesome feature bro. I think it will save a significant amount of time for making inferences |
Hi @tgaddair
|
Thanks for reporting @prd-tuong-nguyen! The second issue is known and being tracked by the FlashInfer team: flashinfer-ai/flashinfer#455 It looks like there's a workaround I can look to add, however. The first issue is new. Can you file an issue and I'll take a look? |
Hey predibase team, nice PR, cool to see that you're using some of TGI's latest features! We appreciate the acknowledgements in the README, but we were wondering if you could add attribution somewhere/link to the original PR when you adapt these features (or mention in the README that it's based on the latest TGI, not 0.9.4!). Thanks a lot for your work on lorax! |
Hey @OlivierDehaene , apologies for that. More than happy to reference specific PRs when pulling upstream changes going forward! |
Usage:
Note that this relies on flashinfer, which is not yet baked into the prebuilt Docker.
Adapted from huggingface/text-generation-inference#2402