NER

Named-entity recognition is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories. For biomedical literature, NER is usally used for identification of gene names, diseases, organisms, cell types, etc. The NER process nowadays has been reshaped by natural language process (NLP) technology.

This repository is to collect various tools and demonstrate how to use those tools for the identification of named entities in biological field. Most tools in thie repository are dependent on one or several large language models. We run API calls to use those models.

PubTator3

PubTator3 uses a high-performance entities search engine, to normalize different forms of the same entity into a unique standardized name to returned all matching articles. It is developed and maintained by NIH.

Example 1: OECD in vitro testing guidelines

Here I use PubTator3 API to demonstrate how to extract entities from raw texts of OECD in vitro test guidelines, which are a set of internationally recognized protocols and standards designed to evaluate the safety and efficacy of chemicals, pharmaceuticals, and other substances using non-animal testing methods. I want to identify what cell lines and genes are used in each testing guideline.

Data preparation

Web scrapping

Text are extracted from search result from OECD iLibrary with the keyword: "in vitro". The descriotion of each in vitro test is processed from html texts and saved in txt files. For more details, please check the scrapper notebook.

Formatting

The PubTator3 API requires txt file as input. I then save each testing guideline content as a txt file. Please check the PubTator3 API data input process notebook.

NER by PubTator3 API

The process consists two steps, the first is to submit your raw text and get the session number. The second step is to use this session number and retrieve data.

Submit request:

cd ./PubTator3/
python SubmitText_request.py ../OECD/Pubtator_Input All SessionNumberFile.txt

After a while (usually 5min), you can run the following code to retrieve data:

cd ./PubTator3/
python SubmitText_retrieve.py ../OECD/Pubtator_Input SessionNumberFile.txt ../OECD/Pubtator_Output/

Summarization

Check the summary notebook for more details. After analysis, OECD in vitro testing guidelines contain 22 guidelines and cover 11 gene targets, with the use of 6 human cell lines.

Example 2: Aging-related genes

Fundemental researchers often build phenotype-/disease-associated gene list for the prescreening step. Here I demonstrate how to build an aging-related gene list by using PubTator3 API. I first extract relation (associate) between disease (Aging_premature) and genes from pubmed literatures. After a preliminary filtering step, the list is ready for further usage.

Relation extraction by PubTator3 API

The process consists two steps, the first is to get a disease ontology list related to aging. The second step is to use this aging-related disease id to get the genes. Please check aging-associated gene identification notebook for more detals.

Summarization

Check the aging-associated gene identification notebook for more details. The disease ontology @DISEASE_AGING_PREMATURE have 391 associated genes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

NER

PubTator3

Example 1: OECD in vitro testing guidelines

Data preparation

Web scrapping

Formatting

NER by PubTator3 API

Summarization

Example 2: Aging-related genes

Relation extraction by PubTator3 API

Summarization

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

NER

PubTator3

Example 1: OECD in vitro testing guidelines

Data preparation

Web scrapping

Formatting

NER by PubTator3 API

Summarization

Example 2: Aging-related genes

Relation extraction by PubTator3 API

Summarization