parseEuropePMC-trimAl

Script to extract the text from different sections and other metadata of the available Full text XML files from EuropePMC.

It extracts the content from the following sections in text format: Introduction, Methods, Results and Discussion. You are able to change the code and extract the sections in XML format. It also retrieves the Supplementary data in XML format.

The metadata outputed is the following: version, parameters, keywords and year.

All the data is then stored in a SQLite database.

The only needed information are PMCIDs although PMIDs can also be used. In that case you will first need to download the file to convert from one to another. This is conversion is automatically done with the parameter --pmc.

Python 3.5 or later is needed. The script depends on standard libraries, plus the ones declared in requirements.txt.

In order to install the dependencies you need pip and venv Python modules.
- pip is available in many Linux distributions (Ubuntu package python-pip, CentOS EPEL package python-pip), and also as pip Python package.
- venv is also available in many Linux distributions (Ubuntu package python3-venv). In some of these distributions venv is integrated into the Python 3.5 (or later) installation.
The creation of a virtual environment and installation of the dependencies in that environment is done running:

python3 -m venv .pyDBenv
source .pyDBenv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Run the following command to use the script:


usage: parseXMLBioHackaton.py [-h] -d DATABASE -i INPUT

This program extracts sections and metadata of XML article from EuropePMC

options:
 -h, --help            show this help message and exit
 -d DATABASE, --database DATABASE
                       Required. Database Name where the data will be stored. Not possible to
                       update the database. It should have '.db' sufix
 -i INPUT, --input INPUT
                       Required. File with all the PMCID that will be inputed to the API. If you
                       write 'all', this script will parse all OpenAccess XML files
 --pmc                 Pass PMIDs instead of PMCIDs

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
citation_stats.csv		citation_stats.csv
parseXMLBioHackaton.py		parseXMLBioHackaton.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

parseEuropePMC-trimAl

About

Releases

Packages

Languages

nicodr97/parseEuropePMC-trimAl

Folders and files

Latest commit

History

Repository files navigation

parseEuropePMC-trimAl

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages