-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizers tokenizer #1261
base: main
Are you sure you want to change the base?
Tokenizers tokenizer #1261
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1261
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit c66ac78 with merge base 11dcbeb (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
f2cba4c
to
3554c3e
Compare
@Jack-Khuu This PR is now the tip of the chain. I've opened it up to review, but I suspect this one will need a lot more discussion than the others. As an FYI, I'm working on a c++ implementation that would support |
Moving conversation on the various open questions here. I think I've just discovered part of why converting from One of the main differences between the UPDATE: Further digging shows this might still be ok for standard cases. For Granite Code at least, the ordering of the tokens in the |
…support Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>
…tokenizers This allows for all HF tokenizers to be supported in the python layer. It will need significant work to offer similar compatibility at the c++ layer. Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>
…kenizer Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>
3554c3e
to
c66ac78
Compare
Pardon the delay: I've been OOO (still am) Thanks again!! |
Not a problem at all, I've been distracted on other threads too. I have some partial work towards a native |
Dependencies
This PR is part of a sequence in support of adding Granite Code. It depends on merging the following PRs:
Issues
Closes #1251
Description
This PR adds partial support for models that use the
tokenizers
(as opposed totiktoken
orsentencepiece
) for tokenization. This PR only addresses support in thepython
runner, and it does so by creating a new class in thetokenizer
module that simply wrapstokenizers
.Discussion
I'm not sure this is the correct direction to go for solving this since the
tokenizers
library is not (to the best of my knowledge) portable to the various export formats (yet). There are two main challenges to extending more tokenizer support outside of simply wrappingtokenizers
:Pre-tokenizers
For may tokenizers, multiple regexes are used in sequence to split the raw string. Not being a regex expert myself, it's not immediately clear to me if it's possible to merge this kind of multi-pass splitting into a single regex. For other tokenizers, a single regex is used, but it is a different expression than any of those currently implemented in
tiktoken
.From my investigation, I think there are a few candidate paths forward:
c++
implementation of the various tokenization routines fromtokenizers
in a separate implementation of theTokenizer
class.c++
TikToken
class to support multiple regexes in the pre-tokenizertokenizer.model
artifact, or somehow making these tokenizer arguments an argument at instantiation time.NOTE: The corresponding tokenization in
llama.cpp
lives here. This code is a full implementation of a unified tokenizer with configuration to dispatch between known patterns and optimized implementations. The config for the model that indicates which tokenizer to use is stored in the model'sGGUF
file directly, so at load time, the correct tokenizer is found based on that value.Special Tokens
Even for models that use a single regex (and even the
llama
regex), models may use different special tokens for special functionality (chat template, FIM, tool calling, other custom prompting). Since thetokenizer.model
, only the vocab is stored, so there is not currently any way to note the special tokens in serialization (similar to the need for configuration of pre-tokenizers).