You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It can happen that the tokenization results are unsatisfactory in some way, and the question is what should be the mechanism to customize/improve them. Whether it should be by,
a) adding options make these optional improvements in the tokenizer. The issue with these is that some of these might be relevant to multiple tokenizers
b) add a new step later in the pipeline. That's probably the best way to allow arbitrary customization. The issue is that some steps might be specific to the previous step, and adding them in the library might be confusing.
There is probably a balance that needs to be found between the two.
For instance,
PunctuationTokenizer,
currently doesn't take into account repeated punctuation
both could probably be addressed by adding an option to force sentences to be longer than some minimal length (and otherwise append them to the previous token).
UnicodeSentenceTokenizer,
will not tokenizer sentences separated by a punctuation without space e.g.,
That's a very common occurrence in actual text, and I think a workaround should be found (e.g. using an additional tokenization pass with a regex/punctuation tokenizer).
Generally it would be good to add some evaluation benchmarks to evaluation/ for sentence tokenization to evaluation/ folder.
UnicodeTokenizer is currently extended in VTextTokenizer (for lack of a better name), with a few additional rules. Maybe this could have been a separate token-processing step, particularly if one imagine that more rules could be added (or potentially even using an ML model).
The text was updated successfully, but these errors were encountered:
It can happen that the tokenization results are unsatisfactory in some way, and the question is what should be the mechanism to customize/improve them. Whether it should be by,
a) adding options make these optional improvements in the tokenizer. The issue with these is that some of these might be relevant to multiple tokenizers
b) add a new step later in the pipeline. That's probably the best way to allow arbitrary customization. The issue is that some steps might be specific to the previous step, and adding them in the library might be confusing.
There is probably a balance that needs to be found between the two.
For instance,
PunctuationTokenizer
,.
as separate sentenceUnicodeSentenceTokenizer
,will not tokenizer sentences separated by a punctuation without space e.g.,
Generally it would be good to add some evaluation benchmarks to
evaluation/
for sentence tokenization toevaluation/
folder.UnicodeTokenizer
is currently extended inVTextTokenizer
(for lack of a better name), with a few additional rules. Maybe this could have been a separate token-processing step, particularly if one imagine that more rules could be added (or potentially even using an ML model).The text was updated successfully, but these errors were encountered: