Support tokenization of special tokens for sentencepiece tokenizer #1445

abuelnasr0 · 2024-02-19T20:09:23Z

This PR enables tokenization of special tokens for SentencePieceTokenizer as was suggested in #1395 and it's a follow up of this PR #1397 .

abuelnasr0 · 2024-02-19T22:53:49Z

I have changed the XLMRoberta tokenizer test spm model to be similar to the original -> https://github.com/huggingface/transformers/blob/345b9b1a6a308a1fa6559251eb33ead2211240ac/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py#L157

because it was most likely a model for another tokenizer may be Albert

mattdangerw · 2024-03-05T05:45:50Z

See the high level comment here #1397 (review), but I think sentencepiece might deserve some special consideration. See this doc on special tokens in sentencepiece https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md

By default, we should leave the sentencepiece proto setting unaltered. Control symbols cannot go in string, and user defined symbols can. This setting we are adding adding would be to allow control symbols to also be treated as user defined symbols. We should check if it is possible to do this by modifying the model proto here https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto#L295-L312 (we want to avoid all these complex tokenization override you have). If not, we might just want to skip sentencepiece for this option, for now.

abuelnasr0 · 2024-04-27T11:50:24Z

@mattdangerw

We should check if it is possible to do this by modifying the model proto here

it's actually possible after I have found this notebook while trying to add user defined tokens to phi3 tokenizer proto.
It turns out that you can access any special token and change it's type from 3 (control) to 4 (user defined) as described in your provided link.
I have tried a small example offline and it worked very good. I will close this PR and open a new one.

abuelnasr0 added 8 commits February 19, 2024 21:42

Add special tokens tokenization

f624a0e

support for albert

c6482f9

Support for f-net

4a0906a

Code format

548d4da

Support for llama

3873ff8

Format the code

98f9ac1

Support for mistral

1960dcf

Support for T5

5601825

abuelnasr0 marked this pull request as draft February 19, 2024 20:09

abuelnasr0 added 2 commits February 20, 2024 00:42

Fix test data of xlm_roberta_tokenizer

79db68c

Support for XLMRoberta

4658641

abuelnasr0 added 4 commits February 20, 2024 01:08

Update xlm_roberta_test_proto script

e0498ee

Format after updating Black

a92176f

Fix XLMRoberta errors

803d618

Support for Debertav3

8a652cf

abuelnasr0 marked this pull request as ready for review February 20, 2024 00:25

abuelnasr0 added 4 commits February 20, 2024 17:33

Rename unsplittable to special

d3c1282

Rename unsplittble to special

e76a959

Add special_tokens Arg to the doc

98817b4

Fix code format

2d59604

abuelnasr0 mentioned this pull request Feb 20, 2024

Support tokenization of special tokens for word_piece_tokenizer #1397

Merged

mattdangerw self-requested a review February 23, 2024 00:46

mattdangerw mentioned this pull request Feb 23, 2024

Fix BytePair special tokens tokenization #1447

Closed

abuelnasr0 closed this Apr 27, 2024

abuelnasr0 deleted the sp_special_tokens branch April 27, 2024 11:51

abuelnasr0 mentioned this pull request May 7, 2024

Add phi3 #1597

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support tokenization of special tokens for sentencepiece tokenizer #1445

Support tokenization of special tokens for sentencepiece tokenizer #1445

abuelnasr0 commented Feb 19, 2024

abuelnasr0 commented Feb 19, 2024

mattdangerw commented Mar 5, 2024

abuelnasr0 commented Apr 27, 2024 •

edited

Loading

Support tokenization of special tokens for sentencepiece tokenizer #1445

Support tokenization of special tokens for sentencepiece tokenizer #1445

Conversation

abuelnasr0 commented Feb 19, 2024

abuelnasr0 commented Feb 19, 2024

mattdangerw commented Mar 5, 2024

abuelnasr0 commented Apr 27, 2024 • edited Loading

abuelnasr0 commented Apr 27, 2024 •

edited

Loading