-
Notifications
You must be signed in to change notification settings - Fork 339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: Tokenization in 0.14.0 adds spaces #856
Comments
If you're seeing the wrong behaviour in llama-tokenize.exe this looks like it's probably an upstream bug? |
Yes, indeed! |
The behavior when the tokenizer adds a space to the first non-special token can be customized via the key
An acceptable workaround: Changing the KV metadata in the GGUF file via a python script works wonders (using a modified However, trying a KV override via This is an upstream bug as it is reproducible with |
It is being fixed upstream: ggerganov/llama.cpp#8614 |
I tried again with LLamaSharp 0.15.0 and although it has been fixed upstream (ggerganov/llama.cpp#8614), the KV override in LLamaSharp via |
Description
When tokenizing a text and decoding these tokens, one can see that tokenization now (as of version 0.14.0) adds one additional starting space to
text
for every call ofContext.Tokenize(text, addBos, special)
. This is especially bad if a text is tokenized with more than one call.Version 0.13.0 did not exhibit such behavior. Or at least, it did not add spaces at the start of words, changing their token ids.
This seems fine for most models (I saw this when using
trollek/NinjaMouse-2.4B-32L-danube
), but when I use gemma-1.1-2b-it-Q6_K.gguf (from: bartowski/gemma-1.1-2b-it-GGUF) now, it's not working anymore. The prompt was:Validating with
tokenize
from llama.cpp b2985 (used in LlamaSharp Version 0.13.0):Interestingly the token at position 2 with id 2425
' user'
adds a starting space to'user'
(id 1645).But even the latest llama.cpp b3412 does not work correctly, look at token at position 2 with id 968
' <'
:Is there a way to completely prevent extra spaces from being added by tokenization anywhere? I will tokenize them by hand if necessary. 😉
Reproduction Steps
Write the prompt (see above) to
prompts.txt
and run:for llama.cpp b2985:
or for llama.cpp b3412:
Environment & Configuration
Known Workarounds
I would love to know!
The text was updated successfully, but these errors were encountered: