Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support tokenization of special tokens for sentencepiece tokenizer #1445

Closed
wants to merge 18 commits into from

Conversation

abuelnasr0
Copy link
Contributor

This PR enables tokenization of special tokens for SentencePieceTokenizer as was suggested in #1395 and it's a follow up of this PR #1397 .

@abuelnasr0 abuelnasr0 marked this pull request as draft February 19, 2024 20:09
@abuelnasr0
Copy link
Contributor Author

I have changed the XLMRoberta tokenizer test spm model to be similar to the original -> https://github.com/huggingface/transformers/blob/345b9b1a6a308a1fa6559251eb33ead2211240ac/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py#L157

because it was most likely a model for another tokenizer may be Albert

@abuelnasr0 abuelnasr0 marked this pull request as ready for review February 20, 2024 00:25
@mattdangerw
Copy link
Member

See the high level comment here #1397 (review), but I think sentencepiece might deserve some special consideration. See this doc on special tokens in sentencepiece https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md

By default, we should leave the sentencepiece proto setting unaltered. Control symbols cannot go in string, and user defined symbols can. This setting we are adding adding would be to allow control symbols to also be treated as user defined symbols. We should check if it is possible to do this by modifying the model proto here https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto#L295-L312 (we want to avoid all these complex tokenization override you have). If not, we might just want to skip sentencepiece for this option, for now.

@abuelnasr0
Copy link
Contributor Author

abuelnasr0 commented Apr 27, 2024

@mattdangerw

We should check if it is possible to do this by modifying the model proto here

it's actually possible after I have found this notebook while trying to add user defined tokens to phi3 tokenizer proto.
It turns out that you can access any special token and change it's type from 3 (control) to 4 (user defined) as described in your provided link.
I have tried a small example offline and it worked very good. I will close this PR and open a new one.

@abuelnasr0 abuelnasr0 closed this Apr 27, 2024
@abuelnasr0 abuelnasr0 deleted the sp_special_tokens branch April 27, 2024 11:51
@abuelnasr0 abuelnasr0 mentioned this pull request May 7, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants