Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching previous prompts for later reuse, part 1 #3073

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

cebtenzzre
Copy link
Member

This PR implements the most basic form of reusing the result from previous prompt processing:

If n_past is smaller in a successive call to prompt() but the input shares a common prefix with the token cache after n_past, we can reuse the tokens that are already in the KV cache. Comparing the new input with the previous is much faster than doing the prompt processing all over again.

An example of a code path in GPT4All that should already benefit from this is the regenerate button, which is currently implemented inefficiently in that it deletes the response and the prompt, and then feeds the prompt to the model again. Now the prompt will be considered a cache hit, and assuming that context shifting hasn't interfered, llmodel will know to reuse the KV cache from last time, skipping the entire prompt processing step.

This also means that if we start templating the entire conversation and using n_past=0 every time, we will not have to do prompt processing from the beginning, assuming the conversation length stays under the context limit.

This does not help if we want to prompt the model with something completely different, and then switch back to the previous conversation, since the KV cache will be truncated in between and not saved. I plan to experiment with a more advanced cache in a follow-up PR.

Still needs testing, consider this WIP.

Signed-off-by: Jared Van Bortel <[email protected]>
The value of n_past is no longer used here, so remove the parameter and
the logic that tries to maintain it.

Signed-off-by: Jared Van Bortel <[email protected]>
These were taking up too many lines, were too repetitive, and weren't
marked [[noreturn]] even though they all throw unconditionally.

Signed-off-by: Jared Van Bortel <[email protected]>
Writing to these directly behind the implementation's back is
ill-defined. Trying to write internal save/restore for `tokens` hurts my
brain when the chat UI is trying to manage this directly.

Signed-off-by: Jared Van Bortel <[email protected]>
If n_past is smaller in a successive call to prompt() but the input
shares a common prefix with the token cache after n_past, we can reuse
the tokens that are already in the KV cache.

Signed-off-by: Jared Van Bortel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant