You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
That's intended, see normalizeSpace at https://nlp.stanford.edu/software/tokenizer.html. It will emit phone numbers (such as 0800 555 111) and numbers with fractions (such as 2 1/2) as a single token with non-breakable spaces in between. Not sure why at uefa.com is joined as well, but I get the same result as you.
Why is for example 0800 555 111 356 included in the generated vocab file? This example is at line 23163. Or is it just me who have this problem?
The text was updated successfully, but these errors were encountered: