Releases: EBIvariation/vep-mapping-pipeline
Tests for the repeat expansion pipeline
- Implemented the tests
- Added a test dataset which includes all NT expansion variants from ClinVar. Records of special interest are moved to the top and annotated. See more details in
repeat_expansion_variants/tests/README.md
- Implemented the integration test using
pytest
- Added a test dataset which includes all NT expansion variants from ClinVar. Records of special interest are moved to the top and annotated. See more details in
- Small adjustments to the pipeline
- Fixed a regexp
- Modified column order
- Technical changes
- Updated configuration files:
.gitignore
,.idea
- Simplified installation instructions for Travis
- Added
setup.py
and__init__.py
to make the repository a proper package (required for running the tests)
- Updated configuration files:
Support for repeat expansion variants
- Implemented the pipeline for handling repeat expansion variants based on ClinVar's
variant_summary
TSV dumps. It performs the following steps:- Load and preprocess variants.
- Parse variant identifiers and extract transcript ID, repeat unit, and coordinate span (where available).
- Match each record to Ensembl gene ID and name.
- Determine variant type based on available information.
- Output the dataframe and the consequences table.
- Implemented supporting modules:
- biomart — to map different columns in the Ensembl Gene database to each other, such as getting Ensembl gene name based on its ID. This uses Ensembl's BioMart API. It's generally useful to have and can be reused for other purposes in the future.
- clinvar_identifier_parsing — to parse the various types of ClinVar variant identifiers for repeat expansion variants.
- Wrote the documentation.
- Technical changes:
- Reorganised directories, files & READMEs to clearly distinguish submodules.
- Streamlined installation instructions, added
requirements.txt
. - Committed IDEA project configuration files to synchronise project-level dictionaries and run profiles.
Disabled distance calculation, improved error handling & integration tests
Adjusted behaviour for upstream/downstream gene variants
When a distance to the gene is reported for upstream/downstream gene variants, it can lead to multiple output records for a single variant, when a gene which it affects has multiple transcription start or end sites. Because this functionality is currently not intended to be used, it has been disabled by default and can be enabled via --report-distance
flag.
Improved error handling
- The
@retry
decorator which handles VEP errors will now catch all possible exception types, not justrequests.HTTPError
, because more things can go wrong (e. g. protocol errors, server responds OK but the data is malformed). - The wrapper script now includes multiple options (
set -euxo pipefail
) which will lead to the script's failure in case any of its components fails. - In case at least one VEP worker fails (having exhausted the retry attempts), the entire workflow will be stopped immediately, and failure will be raised.
Integration tests
The simple integration test includes running a toy dataset of 2,000 variants (originally randomly sampled from ClinVar) through VEP and comparing the annotation result with the expected one. This means that when VEP updates, the test will break; however, this is exactly the intention, as in this case we will be able to compare the results, see if the changes make sense, and update the test data.
Disabled distant querying by default
The second iteration of the VEP querying process, which attempts to find distant variant consequences up to 500,000 bases from each gene, has been disabled by default and can be enabled using the --enable-distant-querying flag. This is because the consequences from this second iteration are not compatible with the standard Open Targets approach, which uses a special consequence class “nearest_gene_five_prime_end”.
Because ClinVar data contains very little number of records which would require such distant querying, reimplementation of this functionality was deemed unnecessary for now, hence the step is disabled.
Initial release
This pipeline originated as a Python port of the original Open Targets "SNP to gene" pipeline. An effort has been taken to retain backwards compatibility where possible; however, many important changes have been introduced.
Breaking changes
- Original pipeline supported both RS IDs and full variant description for querying. Because RS IDs are in general not allele specific, their support has been dropped. New pipeline only accepts complete, VCF-compatible variant descriptions.
- Input file for the original pipeline was a TSV consisting of 10 columns, most of which were not used, at least in the EVA/ClinVar use case. New pipeline uses a simpler VCF-derived format, which can be used to query VEP directly and can be easily produced from VCF (wrapper script does this already).
- Output format for the new pipeline is a 6 column TSV, mostly the same as for the old pipeline. Changes:
- Column 1 (variant identifier) is using a different, VCF-compatible notation.
- Column 5 (functional consequence): special consequence type “nearest_gene_five_prime_end” has been dropped and replaced by conventional “upstream_gene_variant” / “downstream_gene_variant”.
- Column 6 (distance from variant to gene) is now always non-negative, for both upstream and downstream gene variants. It also always denotes a distance to the gene as reported by VEP, not to the nearest gene 5' end.
Changes in handling upstream and downstream gene variants
The original mapping process was similar, but not identical, to the new one. The second step did not attempt to queue VEP, but instead searched for the nearest gene 5' end and, if found, output that gene with the special, non-standard consequence type of “nearest_gene_five_prime_end”, along with the computed distance.
However, this consequence type was not used (at least in the EVA/ClinVar use case), and the variants in question are essentially downstream/upstream variants. Hence, the new pipeline handles all downstream/upstream variants in a similar manner, and outputs distance for all of them.
Changes in handling severity of transcript consequences
The original pipeline contained a serious bug in determining the most severe consequence for a given variant. It worked in the following way:
- Query VEP with default parameters for a given variant, obtain a list of results.
- Take note of the
most_severe_consequence
reported by VEP. - Filter the list of results based on biotype, leaving only protein coding and miRNA transcripts.
- Output all consequences in the list from step 3, where type matches the
most_severe_consequence
determined during step 2.
The problem with this approach is that sometimes the most_severe_consequence
calculated by VEP comes from transcripts of other biotypes (not transcript coding or miRNA), which are filtered out during step 3. This results in the pipeline not outputting any results for such variants.
The new approach does not use VEP's most_severe_consequence
field. Instead, it first filters the consequences based on the list of acceptable biotypes, and then scans the list of the remaining ones the most severe consequence, based on the severity list described on Ensembl website. (This is the same list used by VEP internally.)
Technical changes (should not affect the results)
- Ensembl REST API (https://rest.ensembl.org/) is used instead of Perl API.
- VEP is queried with multiple variants at once (
vep/:species/region
), rather than querying them one by one (vep/:species/id/:id
andvep/:species/region/:region/:allele/
), which greatly speeds up the pipeline and lowers the strain on VEP servers.