Dataset and codes accompanying the paper Open Vocabulary Learning for Neural Chinese Pinyin IME.
Two processed corpora for IME evaluation, the People’s Daily corpus (PD) and the TouchPal corpus (TP) .
Chinese | Pinyin | ||
PD | MIUs | 5.04M | |
Word | 24.7M | 24.7M | |
Vocab | 54.3K | 41.1K | |
Target Vocab (train) | 2309 | - | |
Target Vocab (dec) | 2168 | - | |
TP | MIUs | 689.6K | |
Word | 4.1M | 4.1M | |
Vocab | 27.2K | 20.2K | |
Target Vocab (train) | 2020 | - | |
Target Vocab (dec) | 2009 | - |
.ali target
.py source
.adddict training set
.test2k test set
The full corpus and pre-trained vectors can be downloaded from https://drive.google.com/drive/folders/1v6QW7ULu-iYxU5uruiuSgYGmoXOcHAeX?usp=sharing
We also release our source codes to help others reproduce our result, which is modified from OpenNMT with similar usage.
If you use this repo please cite our paper:
@inproceedings{zhang2019acl-ime,
title = "{Open Vocabulary Learning for Neural Chinese Pinyin IME}",
author = "Zhang, Zhuosheng and Huang, Yafang and Zhao, Hai",
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)",
year = "2019",
}