Skip to content

Script to extract the text from different sections and other metadata of the available Full text XML files from EuropePMC for trimAl

Notifications You must be signed in to change notification settings

nicodr97/parseEuropePMC-trimAl

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

parseEuropePMC-trimAl

Script to extract the text from different sections and other metadata of the available Full text XML files from EuropePMC.

It extracts the content from the following sections in text format: Introduction, Methods, Results and Discussion. You are able to change the code and extract the sections in XML format. It also retrieves the Supplementary data in XML format.

The metadata outputed is the following: version, parameters, keywords and year.

All the data is then stored in a SQLite database.

The only needed information are PMCIDs although PMIDs can also be used. In that case you will first need to download the file to convert from one to another. This is conversion is automatically done with the parameter --pmc.

Python 3.5 or later is needed. The script depends on standard libraries, plus the ones declared in requirements.txt.

  • In order to install the dependencies you need pip and venv Python modules.

    • pip is available in many Linux distributions (Ubuntu package python-pip, CentOS EPEL package python-pip), and also as pip Python package.
    • venv is also available in many Linux distributions (Ubuntu package python3-venv). In some of these distributions venv is integrated into the Python 3.5 (or later) installation.
  • The creation of a virtual environment and installation of the dependencies in that environment is done running:

python3 -m venv .pyDBenv
source .pyDBenv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Run the following command to use the script:


usage: parseXMLBioHackaton.py [-h] -d DATABASE -i INPUT

This program extracts sections and metadata of XML article from EuropePMC

options:
 -h, --help            show this help message and exit
 -d DATABASE, --database DATABASE
                       Required. Database Name where the data will be stored. Not possible to
                       update the database. It should have '.db' sufix
 -i INPUT, --input INPUT
                       Required. File with all the PMCID that will be inputed to the API. If you
                       write 'all', this script will parse all OpenAccess XML files
 --pmc                 Pass PMIDs instead of PMCIDs

About

Script to extract the text from different sections and other metadata of the available Full text XML files from EuropePMC for trimAl

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%