A non-official multi-label classifier based on PLM-ICD paper.
Basically this is my personal side project. The target is deep understanding paper. Finally, here provide a more concise and clear implementation, which can make things easier when need do some custimization or extension.
Although the model comes from paper, I tried my best to make this as a general program for text multi-label classification task.
python -m venv ./_venv --copies
source ./_venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
# deactivate
python -m pytest ./test --cov=./src/plm_icd_multi_label_classifier --durations=0 -v
The ETL contain following steps:
- Origin JSON line dataset preparation
- Transform JSON line file to limited JOSN line file, which means all
list
ordict
will be transformed tostring
. - Data dictionary generation.
Note, the final data folder should contains 4 files: train.jsonl, dev.jsonl, test.jsonl, dict.json.
The data should be in JSON line format, here provide an MIMIC-III data ETL program:
python ./bin/etl/etl_mimic3_processing.py ${YOUR_MIMIC3_DATA_DIRECTORY} ${YOUR_TARGET_OUTPUT_DIRECTORY}
When you need use this program do text multi-label classification on your custimized data set, you can just transfer it into a JSON line file, and using training config file to specify which field is text and which is label.
NOTE, since here you are dealing a multi-label classification task, the format of label field should be as a CSV string, for example:
{"text": "this is a fake text.", "label": "label1,label2,label3,label4"}
But you can also use your specific dataset.
Although using JSON line file, here do not allow list
and dict
contained in JOSN.
I believe "flat" JSON can make things clear, so here provide a tool which can help
to convert list
and dict
contained in JSON to string
:
python ./bin/etl/etl_jsonl2limited_jsonl.py ${ORIGINAL_JSON_LINE_DATASET} ${TRANSFORMED_JSON_LINE_DATASET}
NOTE, alghouth you can put dataset in anly directory you like, but you HAVE TO naming you datasets as train.jsonl, dev.jsonl and test.jsonl.
Generate (some) data dictionaries by scanning train, dev and test data. Run:
python ./bin/etl/etl_generate_data_dict.py ${TRAIN_CONFIG_JSON_FILE_PATH}
CUDA_VISIBLE_DEVICES=0,1,2,3 python ./train.py ${TRAIN_CONFIG_JSON_FILE_PATH}
The format should be JSON, most of parameters are easy to understand is your are a MLE or researcher:
chunk_size
: Each chunks token ID number.chunk_num
: The number of chunk each text/document should have, padding first for short sentences.hf_lm
: HuggingFace language model name/path, eachhf_lm
may have differentlm_hidden_dim
, I personally tried 2 LMs:- "distilbert-base-uncased" with
lm_hidden_dim
as 768 - "medicalai/ClinicalBERT" with
lm_hidden_dim
as 768
- "distilbert-base-uncased" with
lm_hidden_dim
: Language model's hidden output layer's dimension.data_dir
: Data directory, should at least contains two files generated byetl_mimic3_processing.py
:- train.jsonl
- dev.jsonl
- (test.jsonl)
training_engine
: Training engine, can be "torch" or "ray". Torch mode is mainly used for debugging purpose and not supporting distributed training.single_worker_batch_size
: Each worker's batch size. Note if training with "torch" engine, then only have one worker.lr
: Initial learning rate.epochs
: Training epochs.gpu
: If using GPU to train.workers
: Eorkers number in distrubued training. This is only effective when using "ray" as training engine.single_worker_eval_size
: Each worker's maximum evaluation sample size. Again when using "torch" as training engine, you only have one worker.random_seed
: Random seed, this can make sure you can 100% reproduce training.text_col
: Text column name in train/dev/test JSON line dataset.label_col
: Label column name in train/dev/test JSON line dataset.ckpt_dir
: Checkpoint directory name.log_period
: How many batchs passed before each time's evaluation log printing.dump_period
: How many steps passed before each time's checkpoint dumping.
Suppose you put original MIMIC-III data under ./_data/raw/mimic3/
like:
./_data/raw/mimic3/
├── DIAGNOSES_ICD.csv
├── NOTEEVENTS.csv
└── PROCEDURES_ICD.csv
0 directories, 3 files
This is about join necessary tables' data together and build training dataset. Suppose we are
going to put training data under ./_data/etl/mimic3/
, as this programed rules, the directory
should contain 3 files, train.jsonl, dev.jsonl and test.jsonl, like:
./_data/etl/mimic3/
├── dev.jsonl
├── dict.json
├── dim_processed_base_data.jsonl
├── test.jsonl
└── train.jsonl
0 directories, 5 files
You can run:
python ./bin/etl/etl_mimic3_processing.py ./_data/raw/mimic3/ ./_data/etl/mimic3/
The data_dir
in this config will be needed by next ETL step, can just refer to train_mimic3_icd.json
.
Note this step is unnecessary, since the outputs of ./bin/etl/etl_mimic3_processing.py
have
already been limited JSON line files, so even though you run following program, you will get
exactly same files:
python ./bin/etl/etl_jsonl2limited_jsonl.py ./_data/raw/mimic3/${INPUT_JSONL_FILE} ./_data/raw/mimic3/${OUTPUT_JSONL_FILE}
CUDA_VISIBLE_DEVICES=0,1,2,3 python ./train.py ./train_mimic3_icd.json
- After
chunk_size
andchunk_num
defined, each text's token ID length are fixed tochunk_size * chunk_num
. if not long enough then automatically padding first.