To convert a VCF into a MAF, each variant must be mapped to only one of all possible gene transcripts/isoforms that it might affect. This selection of a single effect per variant, is often subjective. So this project is an attempt to make the selection criteria smarter, reproducible, and more configurable. And the default criteria must lean towards best practices. Per the current defaults, a single affected transcript is selected per variant, as follows:
- Sort effects first by transcript biotype priority, then by effect severity, and finally by decreasing transcript length
- Pick the gene on the top of the list (worst-affected), and choose it's canonical transcript (VEP picks the longest CCDS isoform)
- If the gene has no canonical transcript tagged (if you used snpEff), choose its longest transcript instead
Download the latest release of vcf2maf, and view the detailed usage manual:
curl -LO https://github.com/ckandoth/vcf2maf/archive/master.zip; unzip master.zip; cd vcf2maf-master
perl vcf2maf.pl --man
To download properly versioned releases, click here for a list.
If you don't have VEP or snpEff installed, see the sections below. VEP is preferred for it's CLIA-compliant HGVS formats, and is used by default. So after installing VEP, you can test the script like so:
perl vcf2maf.pl --input-vcf data/test.vcf --output-maf data/test.maf
If you'd rather use snpEff, there's an option for that:
perl vcf2maf.pl --input-vcf data/test.vcf --output-maf data/test.snpeff.maf --use-snpeff
If you already have a VCF annotated with either VEP or snpEff, you can use those directly. You should have ran VEP with at least these options: --everything --check_existing --total_length --allele_number --xref_refseq
. And for snpEff use these options: -hgvs -sequenceOntology
. In older versions of snpEff, -sequenceOntology
was incorrectly spelled -sequenceOntolgy
. Feed your VEP/snpEff annotated VCFs into vcf2maf as follows:
perl vcf2maf.pl --input-vep data/test.vep.vcf --output-maf data/test.maf
perl vcf2maf.pl --input-snpeff data/test.snpeff.vcf --output-maf data/test.maf
To fill columns 16 and 17 of the output MAF with tumor/normal sample IDs, and to parse out genotypes and allele counts from matched genotype columns in the VCF, use options --tumor-id
and --normal-id
. Skip option --normal-id
if you didn't have a matched normal:
perl vcf2maf.pl --input-vcf data/test.vcf --output-maf data/test.maf --tumor-id WD1309 --normal-id NB1308
VCFs from variant callers like VarScan use hardcoded sample IDs TUMOR/NORMAL in the genotype columns of the VCF. To have this script correctly parse the correct genotype columns, while still printing the proper IDs in the output MAF:
perl vcf2maf.pl --input-vcf data/test_varscan.vcf --output-maf data/test_varscan.maf --tumor-id WD1309 --normal-id NB1308 --vcf-tumor-id TUMOR --vcf-normal-id NORMAL
If you have VEP in a different folder like /opt/vep
, and cached in /srv/vep
, there are options available to point the script there. Similar options available for snpEff too:
perl vcf2maf.pl --input-vcf data/test.vcf --output-maf data/test.maf --vep-path /opt/vep --vep-data /srv/vep
perl vcf2maf.pl --input-vcf data/test.vcf --output-maf data/test.maf --snpeff-path /opt/snpEff --snpeff-data /opt/snpEff/data --use-snpeff
Ensembl's VEP (Variant Effect Predictor) is popular for how it selects a single "canonical transcript" per gene as detailed here, its CLIA-compliant HGVS variant format, and Sequence Ontology nomenclature for variant effects.
To follow these instructions, we'll assume you have these packaged essentials installed:
sudo yum install -y curl rsync tar make perl perl-core
## OR ##
sudo apt-get install -y curl rsync tar make perl perl-base
Handle VEP's Perl dependencies using cpanminus to install them under ~/perl5
:
curl -L http://cpanmin.us | perl - --notest Archive::Extract Archive::Tar Archive::Zip LWP::Simple CGI DBI Time::HiRes
Set PERL5LIB to find those libraries. Also add this command to the end of your .bashrc
to make it persistent:
export PERL5LIB=~/perl5/lib/perl5:~/perl5/lib/perl5/site_perl
Create temporary shell variables pointing to where we'll store VEP and its cache data (non default paths can be used, but specify --vep-path
and --vep-data
when running vcf2maf):
export VEP_PATH=~/vep
export VEP_DATA=~/.vep
Download the v79 release of VEP:
mkdir $VEP_PATH; cd $VEP_PATH
curl -LO https://github.com/Ensembl/ensembl-tools/archive/release/79.tar.gz
tar -zxf 79.tar.gz --starting-file variant_effect_predictor --transform='s|.*/|./|g'
Download and unpack VEP's offline cache for GRCh37 and GRCh38:
rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-79/variation/VEP/homo_sapiens_vep_79_GRCh{37,38}.tar.gz $VEP_DATA
cat $VEP_DATA/*.tar.gz | tar -izxf - -C $VEP_DATA
Install the Ensembl v79 API and download the reference FASTAs for GRCh37 and GRCh38:
cd $VEP_PATH
perl INSTALL.pl --AUTO af --SPECIES homo_sapiens --ASSEMBLY GRCh37 --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
perl INSTALL.pl --AUTO af --SPECIES homo_sapiens --ASSEMBLY GRCh38 --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
Convert the offline cache for use with tabix, that significantly speeds up the lookup of known variants:
perl convert_cache.pl --species homo_sapiens --version 79_GRCh37 --dir $VEP_DATA
perl convert_cache.pl --species homo_sapiens --version 79_GRCh38 --dir $VEP_DATA
Test running VEP in offline mode, on the provided sample GRCh37 and GRCh38 VCFs:
perl variant_effect_predictor.pl --offline --gencode_basic --everything --total_length --allele_number --no_escape --check_existing --xref_refseq --dir $VEP_DATA --fasta $VEP_DATA/homo_sapiens/79_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa --assembly GRCh37 --input_file example_GRCh37.vcf --output_file example_GRCh37.vep.txt
perl variant_effect_predictor.pl --offline --gencode_basic --everything --total_length --allele_number --no_escape --check_existing --xref_refseq --dir $VEP_DATA --fasta $VEP_DATA/homo_sapiens/79_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa --assembly GRCh38 --input_file example_GRCh38.vcf --output_file example_GRCh38.vep.txt
snpEff (snpeff.sourceforge.net) is popular because of its portability and speed at mapping effects on all possible transcripts in a database like Ensembl or Refseq. It's download-able as a java archive, so make sure you have Java installed.
To follow these instructions, we'll assume you have these bare essentials installed:
curl unzip java
Create temporary shell variables pointing to where we'll store snpEff and its cache data (non default paths can be used, but specify --snpeff-path
and --snpeff-data
when running vcf2maf):
export SNPEFF_PATH=~/snpEff
export SNPEFF_DATA=~/snpEff/data
Download the latest release of snpEff into your home directory:
mkdir $SNPEFF_PATH; cd $SNPEFF_PATH/..
curl -LO http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip
unzip snpEff_latest_core.zip
Import the Ensembl v75 (Gencode v19) database for GRCh37, and Ensembl v78 (Gencode v21) for GRCh38 (writes to snpEff/data
by default):
cd $SNPEFF_PATH
java -Xmx2g -jar snpEff.jar download -dataDir $SNPEFF_DATA GRCh37.75
java -Xmx2g -jar snpEff.jar download -dataDir $SNPEFF_DATA GRCh38.78
Test running snpEff on any available GRCh37 and GRCh38 VCFs:
java -Xmx4g -jar snpEff.jar eff -dataDir $SNPEFF_DATA GRCh37.75 ~/vep/example_GRCh37.vcf > example_GRCh37.snpeff.vcf
java -Xmx4g -jar snpEff.jar eff -dataDir $SNPEFF_DATA GRCh38.78 ~/vep/example_GRCh38.vcf > example_GRCh38.snpeff.vcf
Cyriac Kandoth ([email protected])
Apache-2.0 | Apache License, Version 2.0 | https://www.apache.org/licenses/LICENSE-2.0