Caching previous prompts for later reuse, part 1 #3073
Draft
+291
−180
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR implements the most basic form of reusing the result from previous prompt processing:
If n_past is smaller in a successive call to prompt() but the input shares a common prefix with the token cache after n_past, we can reuse the tokens that are already in the KV cache. Comparing the new input with the previous is much faster than doing the prompt processing all over again.
An example of a code path in GPT4All that should already benefit from this is the regenerate button, which is currently implemented inefficiently in that it deletes the response and the prompt, and then feeds the prompt to the model again. Now the prompt will be considered a cache hit, and assuming that context shifting hasn't interfered, llmodel will know to reuse the KV cache from last time, skipping the entire prompt processing step.
This also means that if we start templating the entire conversation and using n_past=0 every time, we will not have to do prompt processing from the beginning, assuming the conversation length stays under the context limit.
This does not help if we want to prompt the model with something completely different, and then switch back to the previous conversation, since the KV cache will be truncated in between and not saved. I plan to experiment with a more advanced cache in a follow-up PR.
Still needs testing, consider this WIP.