Skip to content

A BERT-based name entity recognition (NER) tool for tagging genes and proteins in raw texts

Notifications You must be signed in to change notification settings

pocession/DiseaseTagger

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DiseaseTagger

DiseaseTagger is a deep-learning-based tool to identify disease ontology from raw texts. It is generated by retraining the BioBERT large v1.1 base model.

In NLP task classification, DiseaseTagger is performing named-entity recognization (NER) task.

Please note DiseaseTagger is a personal project for me to practice the retraining process. Since DiseaseTagger is retrained with a small dataset (only 5430 data points), it may not be useful for work. Therefore for my daily workflow, I use PhenoTagger, a deep-learning-based/dictionary-based hybrid method, to perform NER task. I have extended PhenoTagger into species, compound name, and disease ontology. A repository related to this will be launched later.

Usage

To use DiseaseTagger, please directly download the whole repository from Github. If you want to prepare data and models by yourself, please check the below sections.

Packages

This tools is built and tested on Mac M2 platform. It needs special version of tensorflow. Please see the requirements file

To install all packages:

pip install -r requirements.txt

Check key packages and GPU

To check the key versions and whether GPU is available:

cd Python
python Check_sys_gpu.py

Model

We use BioBERT large v1.1 as our base model (aka pretrained model, foundation model).

To download the model:

cd Python
python Download_model.py dmis-lab/biobert-v1.1 --save_directory ../models//biobert-v1.1-20240414 --show_parameter False --show_layer0 False

Dataset

The dataset comes from NCBI disease.

To download the dataset:

cd Python
python Download_dataset.py ncbi_disease --save_directory ../datasets

Retrain the model

To retrain the biobert model:

cd Python
python Retrain_model.py --model_dir ../models/biobert-v1.1-20240414 --dataset_dir ../datasets/ncbi_disease --output_dir ../models/biobert-v1.1-20240415 --num_epochs 3

Perform NER task

To identify the disease ontology in the raw text:

cd Python
python Tag_text.py --model_dir "../models/biobert-v1.1-20240414" --text "your_text" --output_file "../results/ner_result.csv"

Credit

Many of the codes are adopted from perkdrew's GitHub repository and modified for this project.

About

A BERT-based name entity recognition (NER) tool for tagging genes and proteins in raw texts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published