-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support tokenization of special tokens for word_piece_tokenizer #1397
Support tokenization of special tokens for word_piece_tokenizer #1397
Conversation
Thanks! This is a great feature. I will think more on this next week. I like the overall approach of doing this on top of tf-text instead of inside the tf-text ops. But we may want to think about this more generally. Something like this as an end state...
Main question here for me... Can we guarantee for all tokenizer types that this will work without messing with the "black box" tf-text op? How? @abuelnasr0 let me know what you think! Once we have an overall plan, landing this incrementally (e.g. starting with word piece as you are here), sounds good! |
Do you think
For byte pair and wordpiece (by this PR), I think that is guaranteed. |
@mattdangerw checkout this two PR #1445 (support for sentencepiece) and #1447 (that fixes a small nit in the tokenizer but we can use the same PR to make the behaviour similar for all) For Sentencepiece, I named the argument
Regarding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, dropping some comments here for the whole set of PRs here.
Re special_tokens
or unsplittable_tokens
, I think what we are missing here is configurability. The goal is to be able to turn off or on this behavior not at the base tokenizer class but at the model level implementations. Basically, we want this...
tokenizer = BertTokenizer.from_preset("bert_base_en")
tokenizer(["Bert uses [SEP] to mark sequence boundaries"])
# Tokenize as the tokens "[", "SE", "#P", "]".
tokenizer = BertTokenizer.from_preset("bert_base_en", special_tokens_in_strings=True)
tokenizer(["Bert uses [SEP] to mark sequence boundaries"])
# Tokenize as the tokens "[SEP]", token 102.
We want to give users a knob, that works in a cross model way, to either disallow special tokens in strings, or allow them (without even needing to know all the special tokens themselves).
If you are building a production system, you usually want this setting off. You don't want random string that happen to contain special tokens to mess with the number of segments you have.
If you are writing a guide, or building a prototype, you will often want to allow this. Gaining the ability to add special tokens in string space is very readable.
Does that make sense?
@mattdangerw Indeed! Thanks for that, it sounds rational, and it's a point of view that I was not aware of. I will address that for |
0fe9419
to
d927a86
Compare
@mattdangerw Check out last changes. I have add |
Thanks! Will take a look shortly!
Yeah this ok I think. We would just want to call it out as a breaking change clearly in our next release. Thankfully the fix is quite clear, for anyone who wants the old behavior, pass the option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This looks good to me. Just a couple comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! will pull in when CI is done
…s-team#1397) * Support tokenization of special tokens for word_piece_tokenizer * Add the feature to models tokenizers * Format the code * Fix Fromat * Small fixes * Add tests for bert * Add tests for distilbert * Small fix for bert test * Add tests for electra * Fix code format * Rename unsplittable to special * Edit special_tokens Arg * Format the code * Move special tokens checking into base class * Add special_tokens_in_strings Arg * Shorten comments * Shorten comments * Shorten the logic og splitting and add comments * Code format
this PR enables the user to tokenize special tokens from text input as was suggested in #1395.
The behaviour is the same as
ByteBairTokenizer
I think. also the models' tokenizers that useWordPieceTokenizer
have the same behaviour as the models' tokenizers that useBytePairTokenizer
.