Learning from Negative Samples in Generative Biomedical Entity Linking Framework
ANGEL is a novel framework designed to enhance generative biomedical entity linking (BioEL) by incorporating both positive and negative samples during training. Traditional generative models primarily focus on positive samples during training, which can limit their ability to distinguish between similar entities. We address this limitation by using direct preference optimization with negative samples, significantly enhancing model accuracy.
Key features of ANGEL include:
- Negative-aware Learning: Enhances the model's ability to differentiate between similar entities by using both correct and incorrect examples during training.
- Memory Efficiency: Retains the low memory footprint characteristic of generative models by avoiding the need for large pre-computed entity embeddings.
For a detailed description of our method, please refer to our paper:
ANGEL requires two separate virtual environments: one for positive-only training and another for negative-aware training. Please ensure that CUDA version 11.1 is installed for optimal performance.
To set up the environments and install the required dependencies, run the following script:
bash script/environment/set_environment.sh
Most datasets (i.e., NCBI-disease, BC5CDR, COMETA, and AskAPatient) were used as provided by GenBioEL, and MedMentions was processed in a similar manner using the GenBioEL code. If you need the pre-processing code, you can find it in the GenBioEL repository. To download these datasets and set up the experimental environment, execute the following steps:
bash script/dataset/process_dataset.sh
For training, prepare the data in the following format:
- source: Input text including marked mentions with START and END.
- ex) ['Ocular manifestations of START juvenile rheumatoid arthritis END.']
- target: Pair of the prefix ['mention is'] and the ['target entity'].
- ex) ['juvenile rheumatoid arthritis is ', 'juvenile rheumatoid arthritis']
A Prefix Tree (or Trie) is a type of tree data structure used to efficiently store and manage a set of strings. Each node in the Trie represents a single token, and entire strings are formed by tracing a path from the root to a specific node. In this context, we tokenize target entities and build a Trie structure to restrict the output space to the target knowledge base. To construct the Trie, you need to create a target_kb.json file formatted as a dictionary like below.
{
"C565588": ["epidermolysis bullosa with diaphragmatic hernia"],
"C567755": ["tooth agenesis selective 6", "sthag6"],
"C565584": ["epithelial squamous dysplasia keratinizing desquamative of urinary tract"],
...
}
To experiment with your own dataset, preprocess it in the format described above.
We conducted positive-only pre-training using the code from GenBioEL. If you wish to replicate this, follow the instructions provided in the GenBioEL repository.
Negative-aware pre-training code will be open-sourced shortly. The pre-trained model utilized in this project is uploaded on Hugging Face. If you want to use negative-aware pre-trained model, you can use the model with the following script:
from transformers import AutoModel
# Load the model
model = AutoModel.from_pretrained("dmis-lab/ANGEL_pretrained")
To fine-tune downstream dataset using positive-only training, run:
DATASET=ncbi # bc5cdr, cometa, aap, mm
LEARNING_RATE=3e-7 # 1e-5, 2e-5
STEPS=20000 # 30000, 40000
bash script/train/train_positive.sh $DATASET $LEARNING_RATE $STEPS
The script for other datasets is in the train_positive.sh file.
For negative-aware fine-tuning on the downstream dataset, execute:
DATASET=ncbi # bc5cdr, cometa, aap, mm
LEARNING_RATE=2e-5 # 1e-5, 2e-5
bash script/train/train_negative.sh $DATASET $LEARNING_RATE
The script for other datasets is in the train_negative.sh file.
To perform inference with our best model hosted on Huggingface, use the following script:
DATASET=ncbi # bc5cdr, cometa, aap, mm
bash script/inference/inference.sh $DATASET
The results file in your model folder contains the final scores:
{
"count_top1": 92.812,
"count_top2": 94.062,
"count_top3": 95.208,
"count_top4": 95.625,
"count_top5": 95.729,
...
}
Additionally, the file lists candidates for each mention, indicating correctness:
{
"correctness": "correct",
"given_mention": "non inherited breast carcinomas",
"result": [
" breast carcinomas",
" breast cancer",
...
],
"cui_label": [
"D001943"
],
"cui_result": [
["D001943"],
["114480","D001943"],
...
]
}
We utilized five popular BioEL benchmark datasets: NCBI-disease (NCBI), BC5CDR, COMETA, AskAPatient (AAP), and MedMentions ST21pv (MM-ST21pv).
Model | Dataset | ||||
---|---|---|---|---|---|
NCBI | BC5CDR | COMETA | AAP | MM-ST21pv | |
BioSYN (Sung et al., 2020) | 91.1 | - | 71.3 | 82.6 | - |
SapBERT (Liu et al., 2021) | 92.3 | - | 75.1 | 89.0 | - |
GenBioEL (Yuan et al., 2022b) | 91.0 | 93.1 | 80.9 | 89.3 | 70.7 |
ANGEL (Ours) | 92.8 | 94.5 | 82.8 | 90.2 | 73.3 |
- The scores of GenBioEL were reproduced.
- We excluded the performance of BioSYN and SapBERT on BC5CDR, as they were evaluated separately on the chemical and disease subsets, differing from our settings.
Pre-trained | NCBI | BC5CDR | COMETA | MM-ST21pv |
---|---|---|---|---|
ANGEL_pretrained | ANGEL_ncbi | ANGEL_bc5cdr | ANGEL_cometa | ANGEL_mm |
- The AskAPatient dataset does not have a predefined split; therefore, we utilized a 10-fold cross-validation method to evaluate our model. As a result, there are 10 model checkpoints corresponding to the AskAPatient dataset. Due to this, we have not open-sourced the checkpoints for this dataset.
To run the model without any need for a preprocessed dataset, you can use the run_sample.sh script.
bash script/inference/run_sample.sh ncbi
If you want to modify the sample or input, you can change the script in run_sample.py.
- input_sentence: The sentence that includes the entity you want to normalize. Make sure the entity is enclosed within START and END markers.
- prefix_sentence: A sentence that introduces the context. This should follow the format "entity is".
- candidates: A list of candidate entities that the model will use to attempt normalization.
if __name__ == '__main__':
config = get_config()
# Define the input sentence, marking the entity of interest with START and END
input_sentence = "The r496h mutation of arylsulfatase a does not cause START metachromatic leukodystrophy END"
# Define the prefix sentence to provide context
prefix_sentence = "Metachromatic leukodystrophy is"
# List your candidate entities for normalization
candidates = ["adrenoleukodystrophy", "thrombosis", "anemia", "huntington disease", "leukodystrophy metachromatic"]
run_sample(config, input_sentence, prefix_sentence, candidates)
# Expected Output
Input : The r496h mutation of arylsulfatase a does not cause START metachromatic leukodystrophy END.
Output : Metachromatic leukodystrophy is leukodystrophy metachromatic
By modifying input_sentence, prefix_sentence, and candidates, you can tailor the examples used by the model to fit your specific needs.
If you are interested in our work, please cite:
@article{kim2024learning,
title={Learning from Negative Samples in Generative Biomedical Entity Linking},
author={Kim, Chanhwi and Kim, Hyunjae and Park, Sihyeon and Lee, Jiwoo and Sung, Mujeen and Kang, Jaewoo},
journal={arXiv preprint arXiv:2408.16493},
year={2024}
}
This code includes modifications based on the code of GenBioEL. We are grateful to the authors for providing their code/models as open-source software.