Caching previous prompts for later reuse, part 1 #3073

cebtenzzre · 2024-10-10T22:55:16Z

This PR implements the most basic form of reusing the result from previous prompt processing:

If n_past is smaller in a successive call to prompt() but the input shares a common prefix with the token cache after n_past, we can reuse the tokens that are already in the KV cache. Comparing the new input with the previous is much faster than doing the prompt processing all over again.

An example of a code path in GPT4All that should already benefit from this is the regenerate button, which is currently implemented inefficiently in that it deletes the response and the prompt, and then feeds the prompt to the model again. Now the prompt will be considered a cache hit, and assuming that context shifting hasn't interfered, llmodel will know to reuse the KV cache from last time, skipping the entire prompt processing step.

This also means that if we start templating the entire conversation and using n_past=0 every time, we will not have to do prompt processing from the beginning, assuming the conversation length stays under the context limit.

This does not help if we want to prompt the model with something completely different, and then switch back to the previous conversation, since the KV cache will be truncated in between and not saved. I plan to experiment with a more advanced cache in a follow-up PR.

Still needs testing, consider this WIP.

Signed-off-by: Jared Van Bortel <[email protected]>

The value of n_past is no longer used here, so remove the parameter and the logic that tries to maintain it. Signed-off-by: Jared Van Bortel <[email protected]>

Signed-off-by: Jared Van Bortel <[email protected]>

These were taking up too many lines, were too repetitive, and weren't marked [[noreturn]] even though they all throw unconditionally. Signed-off-by: Jared Van Bortel <[email protected]>

Writing to these directly behind the implementation's back is ill-defined. Trying to write internal save/restore for `tokens` hurts my brain when the chat UI is trying to manage this directly. Signed-off-by: Jared Van Bortel <[email protected]>

If n_past is smaller in a successive call to prompt() but the input shares a common prefix with the token cache after n_past, we can reuse the tokens that are already in the KV cache. Signed-off-by: Jared Van Bortel <[email protected]>

cebtenzzre added 6 commits October 10, 2024 18:05

fix some missing #includes

2267b02

Signed-off-by: Jared Van Bortel <[email protected]>

llmodel: simplify tokenize()

be93986

The value of n_past is no longer used here, so remove the parameter and the logic that tries to maintain it. Signed-off-by: Jared Van Bortel <[email protected]>

llmodel: use a span for evalTokens batch to avoid copy

9848c89

Signed-off-by: Jared Van Bortel <[email protected]>

chatapi: clean up stubs for unimplented methods

163d5fd

These were taking up too many lines, were too repetitive, and weren't marked [[noreturn]] even though they all throw unconditionally. Signed-off-by: Jared Van Bortel <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching previous prompts for later reuse, part 1 #3073

Caching previous prompts for later reuse, part 1 #3073

cebtenzzre commented Oct 10, 2024

Caching previous prompts for later reuse, part 1 #3073

Are you sure you want to change the base?

Caching previous prompts for later reuse, part 1 #3073

Conversation

cebtenzzre commented Oct 10, 2024