Skip to content

Script to extract the text from different sections and other metadata of the available Full text XML files from EuropePMC

Notifications You must be signed in to change notification settings

SergiAguilo/parseEuropePMC-FullXML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

parseEuropePMC-FullXML

Script to extract the text from different sections and other metadata of the available Full text XML files from EuropePMC.

It extracts the content from the following sections in text format: Introduction, Methods, Results and Discussion. You are able to change the code and extract the sections in XML format. It also retrieves the Supplementary data in XML format.

The metadata outputed is the following: ISSN PPUB, ISSN EPUB, Journal Title and Publisher Name.

All the data is then stored in a SQLite database.

Python 3.5 or later is needed. The script depends on standard libraries, plus the ones declared in requirements.txt.

  • In order to install the dependencies you need pip and venv Python modules.

    • pip is available in many Linux distributions (Ubuntu package python-pip, CentOS EPEL package python-pip), and also as pip Python package.
    • venv is also available in many Linux distributions (Ubuntu package python3-venv). In some of these distributions venv is integrated into the Python 3.5 (or later) installation.
  • The creation of a virtual environment and installation of the dependencies in that environment is done running:

python3 -m venv .pyDBenv
source .pyDBenv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Run the following command to use the script:


usage: parseXMLBioHackaton.py [-h] -d DATABASE -i INPUT

This program extracts sections and metadata of XML article from EuropePMC

options:
 -h, --help            show this help message and exit
 -d DATABASE, --database DATABASE
                       Required. Database Name where the data will be stored. Not possible to
                       update the database. It should have '.db' sufix
 -i INPUT, --input INPUT
                       Required. File with all the PMCID that will be inputed to the API. If you
                       write 'all', this script will parse all OpenAccess XML files

About

Script to extract the text from different sections and other metadata of the available Full text XML files from EuropePMC

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages