DiseaseTagger is a deep-learning-based tool to identify disease ontology from raw texts. It is generated by retraining the BioBERT large v1.1 base model.
In NLP task classification, DiseaseTagger is performing named-entity recognization (NER) task.
Please note DiseaseTagger is a personal project for me to practice the retraining process. Since DiseaseTagger is retrained with a small dataset (only 5430 data points), it may not be useful for work. Therefore for my daily workflow, I use PhenoTagger, a deep-learning-based/dictionary-based hybrid method, to perform NER task. I have extended PhenoTagger into species, compound name, and disease ontology. A repository related to this will be launched later.
To use DiseaseTagger, please directly download the whole repository from Github. If you want to prepare data and models by yourself, please check the below sections.
This tools is built and tested on Mac M2 platform. It needs special version of tensorflow. Please see the requirements file
To install all packages:
pip install -r requirements.txt
To check the key versions and whether GPU is available:
cd Python
python Check_sys_gpu.py
We use BioBERT large v1.1 as our base model (aka pretrained model, foundation model).
To download the model:
cd Python
python Download_model.py dmis-lab/biobert-v1.1 --save_directory ../models//biobert-v1.1-20240414 --show_parameter False --show_layer0 False
The dataset comes from NCBI disease.
To download the dataset:
cd Python
python Download_dataset.py ncbi_disease --save_directory ../datasets
To retrain the biobert model:
cd Python
python Retrain_model.py --model_dir ../models/biobert-v1.1-20240414 --dataset_dir ../datasets/ncbi_disease --output_dir ../models/biobert-v1.1-20240415 --num_epochs 3
To identify the disease ontology in the raw text:
cd Python
python Tag_text.py --model_dir "../models/biobert-v1.1-20240414" --text "your_text" --output_file "../results/ner_result.csv"
Many of the codes are adopted from perkdrew's GitHub repository and modified for this project.