-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support tokenization of special tokens for sentencepiece tokenizer #1445
Conversation
I have changed the XLMRoberta tokenizer test spm model to be similar to the original -> https://github.com/huggingface/transformers/blob/345b9b1a6a308a1fa6559251eb33ead2211240ac/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py#L157 because it was most likely a model for another tokenizer may be |
See the high level comment here #1397 (review), but I think sentencepiece might deserve some special consideration. See this doc on special tokens in sentencepiece https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md By default, we should leave the sentencepiece proto setting unaltered. Control symbols cannot go in string, and user defined symbols can. This setting we are adding adding would be to allow control symbols to also be treated as user defined symbols. We should check if it is possible to do this by modifying the model proto here https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto#L295-L312 (we want to avoid all these complex tokenization override you have). If not, we might just want to skip sentencepiece for this option, for now. |
it's actually possible after I have found this notebook while trying to add user defined tokens to phi3 tokenizer proto. |
This PR enables tokenization of special tokens for
SentencePieceTokenizer
as was suggested in #1395 and it's a follow up of this PR #1397 .