asr_nlp_paper_code/nlp_tools.md at main · double22a/asr_nlp_paper_code · GitHub

feature

sentencepiece, SentencePiece is an unsupervised text tokenizer and detokenizer
subword-nmt, preprocessing scripts to segment text into subword units
fastBPE, C++ implementation of Neural Machine Translation of Rare Words with Subword Units, with Python API
chinese text normalization
python-pinyin, 汉字转拼音
zhconv, 中文简繁转换
jieba, Python Chinese word segmentation module
Macropodus, 中文分词、词性标注、命名实体识别、关键词抽取、文本摘要、新词发现、文本相似度、计算器、数字转换、拼音转换、繁简转换等常见NLP功能