Releases · EBIvariation/vep-mapping-pipeline

This repository has been archived by the owner on Sep 15, 2020. It is now read-only.

21 Apr 17:24

tskir

v1.1.0

83f82ee

Tests for the repeat expansion pipeline Latest

Latest

Implemented the tests
- Added a test dataset which includes all NT expansion variants from ClinVar. Records of special interest are moved to the top and annotated. See more details in repeat_expansion_variants/tests/README.md
- Implemented the integration test using pytest
Small adjustments to the pipeline
- Fixed a regexp
- Modified column order
Technical changes
- Updated configuration files: .gitignore, .idea
- Simplified installation instructions for Travis
- Added setup.py and __init__.py to make the repository a proper package (required for running the tests)

Assets 2

21 Apr 10:27

tskir

v1.0.0

64a0b8a

Support for repeat expansion variants

Implemented the pipeline for handling repeat expansion variants based on ClinVar's variant_summary TSV dumps. It performs the following steps:
- Load and preprocess variants.
- Parse variant identifiers and extract transcript ID, repeat unit, and coordinate span (where available).
- Match each record to Ensembl gene ID and name.
- Determine variant type based on available information.
- Output the dataframe and the consequences table.
Implemented supporting modules:
- biomart — to map different columns in the Ensembl Gene database to each other, such as getting Ensembl gene name based on its ID. This uses Ensembl's BioMart API. It's generally useful to have and can be reused for other purposes in the future.
- clinvar_identifier_parsing — to parse the various types of ClinVar variant identifiers for repeat expansion variants.
Wrote the documentation.
Technical changes:
- Reorganised directories, files & READMEs to clearly distinguish submodules.
- Streamlined installation instructions, added requirements.txt.
- Committed IDEA project configuration files to synchronise project-level dictionaries and run profiles.

Assets 2

24 Feb 14:55

tskir

v0.3.0

65f8346

Disabled distance calculation, improved error handling & integration tests Pre-release

Pre-release

Adjusted behaviour for upstream/downstream gene variants

When a distance to the gene is reported for upstream/downstream gene variants, it can lead to multiple output records for a single variant, when a gene which it affects has multiple transcription start or end sites. Because this functionality is currently not intended to be used, it has been disabled by default and can be enabled via --report-distance flag.

Improved error handling

The @retry decorator which handles VEP errors will now catch all possible exception types, not just requests.HTTPError, because more things can go wrong (e. g. protocol errors, server responds OK but the data is malformed).
The wrapper script now includes multiple options (set -euxo pipefail) which will lead to the script's failure in case any of its components fails.
In case at least one VEP worker fails (having exhausted the retry attempts), the entire workflow will be stopped immediately, and failure will be raised.

Integration tests

The simple integration test includes running a toy dataset of 2,000 variants (originally randomly sampled from ClinVar) through VEP and comparing the annotation result with the expected one. This means that when VEP updates, the test will break; however, this is exactly the intention, as in this case we will be able to compare the results, see if the changes make sense, and update the test data.

Assets 2

18 Feb 19:26

tskir

v0.2.0

641bb11

Disabled distant querying by default Pre-release

Pre-release

The second iteration of the VEP querying process, which attempts to find distant variant consequences up to 500,000 bases from each gene, has been disabled by default and can be enabled using the --enable-distant-querying flag. This is because the consequences from this second iteration are not compatible with the standard Open Targets approach, which uses a special consequence class “nearest_gene_five_prime_end”.

Because ClinVar data contains very little number of records which would require such distant querying, reimplementation of this functionality was deemed unnecessary for now, hence the step is disabled.

Assets 2

17 Feb 12:11

tskir

v0.1.0

68245c8

Initial release Pre-release

Pre-release

This pipeline originated as a Python port of the original Open Targets "SNP to gene" pipeline. An effort has been taken to retain backwards compatibility where possible; however, many important changes have been introduced.

Breaking changes

Original pipeline supported both RS IDs and full variant description for querying. Because RS IDs are in general not allele specific, their support has been dropped. New pipeline only accepts complete, VCF-compatible variant descriptions.
Input file for the original pipeline was a TSV consisting of 10 columns, most of which were not used, at least in the EVA/ClinVar use case. New pipeline uses a simpler VCF-derived format, which can be used to query VEP directly and can be easily produced from VCF (wrapper script does this already).
Output format for the new pipeline is a 6 column TSV, mostly the same as for the old pipeline. Changes:
- Column 1 (variant identifier) is using a different, VCF-compatible notation.
- Column 5 (functional consequence): special consequence type “nearest_gene_five_prime_end” has been dropped and replaced by conventional “upstream_gene_variant” / “downstream_gene_variant”.
- Column 6 (distance from variant to gene) is now always non-negative, for both upstream and downstream gene variants. It also always denotes a distance to the gene as reported by VEP, not to the nearest gene 5' end.

Changes in handling upstream and downstream gene variants

The original mapping process was similar, but not identical, to the new one. The second step did not attempt to queue VEP, but instead searched for the nearest gene 5' end and, if found, output that gene with the special, non-standard consequence type of “nearest_gene_five_prime_end”, along with the computed distance.

However, this consequence type was not used (at least in the EVA/ClinVar use case), and the variants in question are essentially downstream/upstream variants. Hence, the new pipeline handles all downstream/upstream variants in a similar manner, and outputs distance for all of them.

Changes in handling severity of transcript consequences

The original pipeline contained a serious bug in determining the most severe consequence for a given variant. It worked in the following way:

Query VEP with default parameters for a given variant, obtain a list of results.
Take note of the most_severe_consequence reported by VEP.
Filter the list of results based on biotype, leaving only protein coding and miRNA transcripts.
Output all consequences in the list from step 3, where type matches the most_severe_consequence determined during step 2.

The problem with this approach is that sometimes the most_severe_consequence calculated by VEP comes from transcripts of other biotypes (not transcript coding or miRNA), which are filtered out during step 3. This results in the pipeline not outputting any results for such variants.

The new approach does not use VEP's most_severe_consequence field. Instead, it first filters the consequences based on the list of acceptable biotypes, and then scans the list of the remaining ones the most severe consequence, based on the severity list described on Ensembl website. (This is the same list used by VEP internally.)

Technical changes (should not affect the results)

Ensembl REST API (https://rest.ensembl.org/) is used instead of Perl API.
VEP is queried with multiple variants at once (vep/:species/region), rather than querying them one by one (vep/:species/id/:id and vep/:species/region/:region/:allele/), which greatly speeds up the pipeline and lowers the strain on VEP servers.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjusted behaviour for upstream/downstream gene variants

Improved error handling

Integration tests

Breaking changes

Changes in handling upstream and downstream gene variants

Changes in handling severity of transcript consequences

Technical changes (should not affect the results)

Releases: EBIvariation/vep-mapping-pipeline

Tests for the repeat expansion pipeline

Support for repeat expansion variants

Disabled distance calculation, improved error handling & integration tests

Adjusted behaviour for upstream/downstream gene variants

Improved error handling

Integration tests

Disabled distant querying by default

Initial release

Breaking changes

Changes in handling upstream and downstream gene variants

Changes in handling severity of transcript consequences

Technical changes (should not affect the results)