Pytorch implementation of Context-Sensitive Spelling Correction of Clinical Text via Conditional Independence, CHIL 2022.
This model (CIM) corrects misspellings with a char-based language model and a corruption model (edit distance). The model is being pre-trained and evaluated on clinical corpus and datasets. Please see the paper for more detailed explanation.
- Python 3.8 and packages in
requirements.txt
- The MIMIC-III dataset (v1.4): PhysioNet link
- BlueBERT: GitHub link
- The SPECIALIST Lexicon of UMLS: LSG website
- English dictionary (DWYL): GitHub link
$ git clone --recursive https://github.com/dalgu90/cim-misspelling.git
-
Download the MIMIC-III dataset from PhysioNet, especially
NOTEEVENTS.csv
and put underdata/mimic3
-
Download
LRWD
andprevariants
of the SPECIALIST Lexicon from the LSG website (2018AB version) and put underdata/umls
. -
Download the English dictionary
english.txt
from here (commit 7cb484d) and put underdata/english_words
. -
Run
scripts/build_vocab_corpus.ipynb
to build the dictionary and split the MIMIC-III notes into files. -
Run the Jupyter notebook for the dataset that you want to download/pre-process:
-
Download the BlueBERT model from here under
bert/ncbi_bert_{base|large}
.- For CIM-Base, please download "BlueBERT-Base, Uncased, PubMed+MIMIC-III"
- For CIM-Large, please download "BlueBERT-Large, Uncased, PubMed+MIMIC-III"
Please run pretrain_cim_base.sh
(CIM-Base) or pretrain_cim_large.sh
(CIM-Large) and to pretrain the character langauge model of CIM.
The pre-training will evaluate the LM periodically by correcting synthetic misspells generated from the MIMIC-III data.
You may need 2~4 GPUs (XXGB+ GPU memory for CIM-Base and YYGB+ for CIM-Large) to pre-train with the batch size 256.
There are several options you may want to configure:
num_gpus
: number of GPUsbatch_size
: batch sizetraining_step
: total number of steps to traininit_ckpt
/init_step
: the checkpoint file/steps to resume pretrainingnum_beams
: beam search width for evaluationmimic_csv_dir
: directory of the MIMIC-III csv splitsbert_dir
: directory of the BlueBERT files
You can also download the pre-trained LMs and put under model/
(e.g. the CIM-base checkpoint is placed as model/cim_base/ckpt-475000.pkl
):
Please specify the dataset dir and the file to evaluate in the evaluation script (eval_cim_base.sh
or eval_cim_large.sh
), and run the script.
You may want to set init_step
to specify the checkpoint you want to load
@InProceedings{juyong2022context,
title = {Context-Sensitive Spelling Correction of Clinical Text via Conditional Independence},
author = {Kim, Juyong and Weiss, Jeremy C and Ravikumar, Pradeep},
booktitle = {Proceedings of the Conference on Health, Inference, and Learning},
pages = {234--247},
year = {2022},
volume = {174},
series = {Proceedings of Machine Learning Research},
month = {07--08 Apr},
publisher = {PMLR}
}