Skip to content

Latest commit

 

History

History
18 lines (9 loc) · 969 Bytes

nlp_tools.md

File metadata and controls

18 lines (9 loc) · 969 Bytes

feature

  • sentencepiece, SentencePiece is an unsupervised text tokenizer and detokenizer

  • subword-nmt, preprocessing scripts to segment text into subword units

  • fastBPE, C++ implementation of Neural Machine Translation of Rare Words with Subword Units, with Python API

  • chinese text normalization

  • python-pinyin, 汉字转拼音

  • zhconv, 中文简繁转换

  • jieba, Python Chinese word segmentation module

  • Macropodus, 中文分词、词性标注、命名实体识别、关键词抽取、文本摘要、新词发现、文本相似度、计算器、数字转换、拼音转换、繁简转换等常见NLP功能