A phenotype-based tool to annotate and prioritize disease variants in WES and WGS data
This user guide have been tested on Ubuntu version 16.04.
For details regarding model training and evaluation, please refer to dev/ directory above.
- At least 32 GB RAM.
- At least 1TB free disk space to process and accommodate the necessary databases for annotation
- Any Unix-based operating system
- Java 8
- Python 2.7 (as a system default version) and install the dependencies (for Python 2.7) with:
pip install -r requirements.txt
- Run python 2 for the script test.py (available above) to test the installation of the python dependencies. If the script fails, please try again to install the required dependencies ( using "pip2" instead of "pip", checking for permissions, or try the docker image instead).
- Download the distribution file phenomenet-vp-2.1.zip
- Download the data files phenomenet-vp-2.1-data.zip
- Extract the distribution files
phenomenet-vp-2.1.zip
- Extract the data files
data.tar.gz
inside the directory phenomenet-vp-2.1 - cd
phenomenet-vp-2.1
- Run the command:
bin/phenomenet-vp
to display help and parameters.
- Download CADD database file.
- Download and run the script generate.sh (Requires TABIX).
- Copy the generated files
cadd.txt.gz
andcadd.txt.gz.tbi
to directoryphenomenet-vp-2.1/data/db
. - Download DANN database file and its indexed file to directory
phenomenet-vp-2.1/data/db
. - Rename the DANN files as
dann.txt.gz
anddann.txt.gz.tbi
respectively.
- Install Docker
- Download the data files phenomenet-vp-2.1-data.zip and database requirements
- Build phenomenet-vp docker image:
docker build -t phenomenet-vp .
- Run phenomenet
docker run -v $(pwd)/data:/data phenomenet-vp -f data/Miller.vcf -o OMIM:263750
--file, -f
Path to VCF file
--outfile, -of
Path to results file
--inh, -i
Mode of inheritance
Default: unknown
--json, -j
Path to PhenoTips JSON file containing phenotypes
--omim, -o
OMIM ID
--phenotypes, -p
List of phenotype ids separated by commas
--human, -h
Propagate human disease phenotypes to genes only
Default: false
--sp, -s
Propagate mouse and fish disease phenotypes to genes only
Default: false
--digenic, -d
Rank digenic combinations
Default: false
--trigenic, -t
Rank trigenic combinations
Default: false
--combination, -c
Maximum Number of variant combinations to prioritize (for digenic and
trigenic cases only)
Default: 1000
--ngenes, -n
Number of genes in oligogenic combinations (more than three)
Default: 4
--oligogenic, -og
Rank oligogenic combinations
Default: false
--python, -y
Path to Python executable (ex. /usr/bin/python)
Default: python
To run the tool, the user needs to provide a VCF file along with either an OMIM ID of the disease or a list of phenotypes (HPO or MPO terms).
a) Prioritize disease-causing variants using an OMIM ID:
bin/phenomenet-vp -f data/Miller.vcf -o OMIM:263750
b) Prioritize digenic disease-causing variants using an OMIM ID, and gene-to-phenotype datta from human studies only:
bin/phenomenet-vp -f data/Miller.vcf -o OMIM:263750 --human --digenic
c) Prioritize disease-causing variants using a set of phenotypes, and recessive inheritance mode
bin/phenomenet-vp -f data/Miller.vcf -p HP:0000007,HP:0000028,HP:0000054,HP:0000077,HP:0000175 -i recessive
The result file will be at the directory containg the input file. The output file has the same name as input file with .res extension. For digenic, trigenic or oligogenic prioritization, the result file will have .digenic, .trigenic, or .oligogenic extension repectivly.
In order to effectively analysis rare variants, it is strongly recommended to filter the input VCF files by MAF prior to running phenomenet-vp on it. To do so, follow the instructions below:
a) Install VCFtools.
b) Run the following command using VCFtools on your input VCF file to filter out variants with MAF > 1%:
vcftools --vcf input_file.vcf --recode --max-maf 0.01 --out filtered
c) Run PVP on the output file filtered.recode.vcf generated from the command above.
The original random-forest-based PVP tool is available to download here along with its required data files here. The prepared set of exomes and genomes used for the analysis and results are provided here.
The updated neural-network model, DeepPVP is available to download here along with its required data files here. The prepared set of exomes used for the analysis and comparative results are provided here. The comparison with PVP is based on PVP-1.1 available here along with its required data files here.
OligoPVP is provided as part of DeepPVP tool using the parameters --digenic, --trigenicm and --oligogenic for ranking candidate disease-causing variant pairs and triples. Our prepared set of synthetic genomes digenic combinations are available here using data from the DIgenic diseases DAtabase (DIDA). The comparison results with other methods are also provided. Results were obtained using DeepPVP v2.0.
PVP is jointly developed by researchers at the University of Birmingham (Prof George Gkoutos and his team), University of Cambridge (Dr Paul Schofield and his team), and King Abdullah University of Science and Technology (Prof Vladimir Bajic, Robert Hoehndorf, and teams).
[1] Boudellioua I, Mahamad Razali RB, Kulmanov M, Hashish Y, Bajic VB, Goncalves-Serra E, Schoenmakers N, Gkoutos GV., Schofield PN., and Hoehndorf R. (2017) Semantic prioritization of novel causative genomic variants. PLOS Computational Biology. https://doi.org/10.1371/journal.pcbi.1005500
[2] Boudellioua I, Kulmanov M, Schofield PN., Gkoutos GV., and Hoehndorf R . (2018) OligoPVP: Phenotype-driven analysis of individual genomic information to prioritize oligogenic disease variants. Scientific Reports. https://doi.org/10.1038/s41598-018-32876-3
[3] Boudellioua I, Kulmanov M, Schofield PN., Gkoutos GV., and Hoehndorf R . (2019) DeepPVP: phenotype-based prioritization of causative variants using deep learning. BMC Bioinformatics. https://doi.org/10.1186/s12859-019-2633-8
Copyright (c) 2016-2018, King Abdullah University of Science and Technology All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. All advertising materials mentioning features or use of this software must display the following acknowledgment: This product includes software developed by the King Abdullah University of Science and Technology. 4. Neither the name of the King Abdullah University of Science and Technology nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY KING ABDULLAH UNIVERSITY OF SCIENCE AND TECHNOLOGY ''AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL KING ABDULLAH UNIVERSITY OF SCIENCE AND TECHNOLOGY BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.