Better unicode support in tokenization rules #31

rth · 2019-04-13T09:13:37Z

Currently, the VTextTokenizer first computes Unicode segmentation (which should handle Unicode well by definition) than applies a few simple rules on top to produce tokenization that is more standard in NLP (and possibly language dependent).

These rules might need to be generalized a bit to handle Unicode better. For instance, currently we merge tokens linked by - but only the ascii one, not on other Unicode variants.

The text was updated successfully, but these errors were encountered:

rth · 2019-04-13T10:57:35Z

Using https://github.com/BurntSushi/utf8-ranges would probably be quite useful without sacrificing speed too much.

rth added the tokenization label Apr 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better unicode support in tokenization rules #31

Better unicode support in tokenization rules #31

rth commented Apr 13, 2019

rth commented Apr 13, 2019

Better unicode support in tokenization rules #31

Better unicode support in tokenization rules #31

Comments

rth commented Apr 13, 2019

rth commented Apr 13, 2019