Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizers tokenizer #1261

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

gabe-l-hart
Copy link
Contributor

@gabe-l-hart gabe-l-hart commented Oct 3, 2024

Dependencies

This PR is part of a sequence in support of adding Granite Code. It depends on merging the following PRs:

Issues

Closes #1251

Description

This PR adds partial support for models that use the tokenizers (as opposed to tiktoken or sentencepiece) for tokenization. This PR only addresses support in the python runner, and it does so by creating a new class in the tokenizer module that simply wraps tokenizers.

Discussion

I'm not sure this is the correct direction to go for solving this since the tokenizers library is not (to the best of my knowledge) portable to the various export formats (yet). There are two main challenges to extending more tokenizer support outside of simply wrapping tokenizers:

Pre-tokenizers

For may tokenizers, multiple regexes are used in sequence to split the raw string. Not being a regex expert myself, it's not immediately clear to me if it's possible to merge this kind of multi-pass splitting into a single regex. For other tokenizers, a single regex is used, but it is a different expression than any of those currently implemented in tiktoken.

From my investigation, I think there are a few candidate paths forward:

  1. Provide a c++ implementation of the various tokenization routines from tokenizers in a separate implementation of the Tokenizer class.
  2. Extend the existing c++ TikToken class to support multiple regexes in the pre-tokenizer
    • This would also correspond with needing to make the set of patterns configurable and therefore serialized into the tokenizer.model artifact, or somehow making these tokenizer arguments an argument at instantiation time.

NOTE: The corresponding tokenization in llama.cpp lives here. This code is a full implementation of a unified tokenizer with configuration to dispatch between known patterns and optimized implementations. The config for the model that indicates which tokenizer to use is stored in the model's GGUF file directly, so at load time, the correct tokenizer is found based on that value.

Special Tokens

Even for models that use a single regex (and even the llama regex), models may use different special tokens for special functionality (chat template, FIM, tool calling, other custom prompting). Since the tokenizer.model, only the vocab is stored, so there is not currently any way to note the special tokens in serialization (similar to the need for configuration of pre-tokenizers).

Copy link

pytorch-bot bot commented Oct 3, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1261

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c66ac78 with merge base 11dcbeb (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 3, 2024
@gabe-l-hart gabe-l-hart force-pushed the TokenizersTokenizer-1251 branch 7 times, most recently from f2cba4c to 3554c3e Compare October 9, 2024 23:52
@gabe-l-hart gabe-l-hart marked this pull request as ready for review October 10, 2024 16:07
@gabe-l-hart
Copy link
Contributor Author

@Jack-Khuu This PR is now the tip of the chain. I've opened it up to review, but I suspect this one will need a lot more discussion than the others. As an FYI, I'm working on a c++ implementation that would support tokenizers tokenizers (branch), but it's slow going with other competing priorities.

@gabe-l-hart
Copy link
Contributor Author

gabe-l-hart commented Oct 10, 2024

Moving conversation on the various open questions here.

I think I've just discovered part of why converting from tokenizers to tiktoken format (e.g. with my script) is not straightforward.

One of the main differences between the tokenizer.model format and the tokenizer.json, besides the presence of a bunch of metadata, is that the vocab and merges are held separately in tokenizer.json whereas the merge ranks are explicitly expected to match the IDs in tokenizer.model. This comment seems to indicate that this is one way that the vocab can be constructed, but that it is not a required part of the BPE algorithm. This would indicate that tiktoken -> tokenizers should work fine, but tokenizers -> tiktoken will be much harder because there's no guarantee that this assumption about ranks will be met in an arbitrary vocab/merges in a tokenizers model.

UPDATE: Further digging shows this might still be ok for standard cases. For Granite Code at least, the ordering of the tokens in the merges strictly matches the "correct" rank and always has a value offset of 261. After a bunch of digging, I think I've convinced myself that the numeric value of the rank is not critical since its usage is entirely around a priority queue when performing merges. As such, having the ordering match should produce the same results.

…support

Branch: GraniteCodeSupport

Signed-off-by: Gabe Goodhart <[email protected]>
…tokenizers

This allows for all HF tokenizers to be supported in the python layer. It
will need significant work to offer similar compatibility at the c++ layer.

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteCodeSupport

Signed-off-by: Gabe Goodhart <[email protected]>
…kenizer

Branch: GraniteCodeSupport

Signed-off-by: Gabe Goodhart <[email protected]>
@Jack-Khuu
Copy link
Contributor

Pardon the delay: I've been OOO (still am)
Will take a look when I get back to office

Thanks again!!

@gabe-l-hart
Copy link
Contributor Author

Not a problem at all, I've been distracted on other threads too. I have some partial work towards a native c++ implementation that supports multiple pre-tokenizer regexes and custom special tokens. At the same time, one of those distracting threads has had me looking more closely at sentencepiece and it's possible we could go the route of converting from tokenizers -> sentencepiece and avoid the need for a full c++ implementation. I'll update as I get more clarity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for tokenizers tokenizers
3 participants