The code for distant supervision generation is in corpus.ipynb
. The next step is to train a standard sequence labeling model (Bi-LSTM, RoBERTa, ChemBERTa, ...) based on distant supervision.
The data is in the folder /data
. The training data is too big to be uploaded and can be found here: CHEM_train.json. The human-annotated test data is in /data/CHEM_test_annotations.jsonl
.
@inproceedings{wang2021chemner,
title={ChemNER: Fine-grained chemistry named entity recognition with ontology-guided distant supervision},
author={Wang, Xuan and Hu, Vivian and Song, Xiangchen and Garg, Shweta and Xiao, Jinfeng and Han, Jiawei},
booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
year={2021}
}