GitHub - cooelf/OpenIME: Open Vocabulary Learning for Neural Chinese Pinyin IME (ACL 2020)

Dataset and codes accompanying the paper Open Vocabulary Learning for Neural Chinese Pinyin IME.

Dataset

Two processed corpora for IME evaluation, the People’s Daily corpus (PD) and the TouchPal corpus (TP) .

		Chinese	Pinyin
PD	MIUs	5.04M
	Word	24.7M	24.7M
	Vocab	54.3K	41.1K
	Target Vocab (train)	2309	-
	Target Vocab (dec)	2168	-
TP	MIUs	689.6K
	Word	4.1M	4.1M
	Vocab	27.2K	20.2K
	Target Vocab (train)	2020	-
	Target Vocab (dec)	2009	-

.ali target

.py source

.adddict training set

.test2k test set

The full corpus and pre-trained vectors can be downloaded from https://drive.google.com/drive/folders/1v6QW7ULu-iYxU5uruiuSgYGmoXOcHAeX?usp=sharing

Source Code

We also release our source codes to help others reproduce our result, which is modified from OpenNMT with similar usage.

Reference

If you use this repo please cite our paper:

@inproceedings{zhang2019acl-ime,
	title = "{Open Vocabulary Learning for Neural Chinese Pinyin IME}",
	author = "Zhang, Zhuosheng and Huang, Yafang and Zhao, Hai",
	booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)",
	year = "2019",
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
benchmark		benchmark
onmt		onmt
rocks		rocks
test		test
tools		tools
README.md		README.md
lm.lua		lm.lua
preprocess.lua		preprocess.lua
score.py		score.py
score2.py		score2.py
score_top5.py		score_top5.py
tag.lua		tag.lua
train.lua		train.lua
translate.lua		translate.lua

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset

Source Code

Reference

About

Releases

Packages

Languages

cooelf/OpenIME

Folders and files

Latest commit

History

Repository files navigation

Dataset

Source Code

Reference

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages