Skip to content
This repository has been archived by the owner on Sep 15, 2020. It is now read-only.

Commit

Permalink
Merge pull request #1 from tskir/eva-1831-distal-variants
Browse files Browse the repository at this point in the history
EVA-1831 — Disable distant querying by default
  • Loading branch information
tskir authored Feb 18, 2020
2 parents 68245c8 + 742307d commit 641bb11
Show file tree
Hide file tree
Showing 2 changed files with 46 additions and 25 deletions.
12 changes: 7 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ The pipeline uses succinct, VCF-compatible variant identifiers, in the format of
### Output file
Output is a TSV file consisting of six columns:
1. Variant identifier: the same one as in the input files.
2. The second column is not used and is always set to 1. It is retained for compabitility purposes (see more about that below).
2. The second column is not used and is always set to 1. It is retained for compabitility purposes (see more about that in the release notes).
3. Ensembl gene ID.
4. Ensembl gene name.
5. Most severe functional consequence of that variant for that gene.
Expand All @@ -33,15 +33,15 @@ Example (note that tabs have been replaced with spaces here for readability):
```

### Running the script directly (on a small number of entries, for testing/debugging only)
The pipeline core module is [consequence_mapping.py](/bin/consequence_mapping/consequence_mapping.py). It reads data from STDIN and writes to STDOUT in the formats described above. It can be run as follows:
The pipeline core module is [consequence_mapping.py](/vep_mapping_pipeline/consequence_mapping.py). It reads data from STDIN and writes to STDOUT in the formats described above. It can be run as follows:
```bash
python3 consequence_mapping.py <input_variants.txt >output_mappings.tsv
```

This should only be done for testing purposes and only for a small number of variants, because when querying VEP API, the script submits of all of the variants it receives in a single query (this is more efficient than submitting them one by one).

### Running the pipeline using a wrapper script
In production environment the pipeline should be run using a wrapper script which would take care of preprocessing and parallelisation. There is a simple wrapper script available, [run_consequence_mapping.sh](/bin/consequence_mapping/run_consequence_mapping.sh). It can be run as follows:
In production environment the pipeline should be run using a wrapper script which would take care of preprocessing and parallelisation. There is a simple wrapper script available, [run_consequence_mapping.sh](/vep_mapping_pipeline/run_consequence_mapping.sh). It can be run as follows:
```bash
bash run_consequence_mapping.sh input_variants.vcf output_mappings.tsv
```
Expand All @@ -60,8 +60,10 @@ The wrapper script depends on `bcftools` and `parallel` (GNU Parallel).

## Mapping process
For each variant:
* Query VEP with default distance to the gene (5,000 bases either way). Consider only consequences which affect **either a protein coding transcript or a miRNA**. Find the most severe consequence for either of those transcripts, according to the list of priorities described [on Ensembl website](https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html). Output **all non-redundant** consequences of this (most severe) type.
* If we found nothing during the previous step, repeat VEP search for this variant, now with a distance up to 500,000 bases either way. If there are any consequences which affect a **protein coding** transcript, choose the most severe consequence type (usually this will be either an upstream or a downstream gene variant) and output **a single consequence with the smallest distance**.
1. Query VEP with default distance to the gene (5,000 bases either way). Consider only consequences which affect **either a protein coding transcript or a miRNA**. Find the most severe consequence for either of those transcripts, according to the list of priorities described [on Ensembl website](https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html). Output **all non-redundant** consequences of this (most severe) type.
2. *Optionally:* if we found nothing during the previous step, repeat VEP search for this variant, now with a distance up to 500,000 bases either way. If there are any consequences which affect a **protein coding** transcript, choose the most severe consequence type (usually this will be either an upstream or a downstream gene variant) and output **a single consequence with the smallest distance**.

Distant querying (Step 2), while disabled by default, can be enabled by appending `--enable-distant-querying` flag to the core pipeline.

## Note on porting & changes from the original pipeline
This pipeline originated as a Python port of the original Open Targets ["SNP to gene"](https://github.com/opentargets/snp_to_gene) pipeline. An effort has been taken to retain backwards compatibility where possible; however, many important changes have been introduced. Please see the release notes for description of those changes.
Expand Down
59 changes: 39 additions & 20 deletions vep_mapping_pipeline/consequence_mapping.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
#!/usr/bin/env python3
"""Pipeline for mapping variants to the genes they affect and their functional consequences, using Ensembl VEP API. For
documentation, refer to /README.md"""

"""
Pipeline for mapping variants to the genes they affect and their functional consequences, using Ensembl VEP API.
For documentation, refer to /README.md
"""

import argparse
import itertools
import json
import logging
Expand All @@ -14,6 +12,12 @@

from retry import retry

parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
'--enable-distant-querying', action='store_true',
help='Enables a second iteration of querying VEP for distant gene variants, which is disabled by default'
)

logging.basicConfig()
logger = logging.getLogger('consequence_mapping')
logger.setLevel(logging.INFO)
Expand Down Expand Up @@ -121,9 +125,14 @@ def get_variants_without_consequences(results_by_variant):
})


def process_variants(variants):
def process_variants(variants, enable_distant_querying=False):
"""Given a list of variant IDs, return a list of consequence types (each including Ensembl gene name & ID and a
functional consequence code) for a given variant."""
functional consequence code) for a given variant.
Args:
enable_distant_querying: If set to True, an additional VEP query will be performed for variants for which no
consequences were found during the first iteration, in an attempt to find distant variant consequences.
"""

# First, we query VEP with default parameters, looking for variants affecting protein coding and miRNA transcripts
# up to a standard distance (5000 nucleotides either way, which is default for VEP) from the variant.
Expand All @@ -134,21 +143,27 @@ def process_variants(variants):

# See if there are variants with no consequences up to the default distance
variants_without_consequences = get_variants_without_consequences(results_by_variant)
# If there are, we will now do a second round of querying, this time looking only at protein coding biotypes (vs.
# miRNA *and* protein coding during the first round) up to a distance of 500,000 bases each way.
if variants_without_consequences:
logger.info('Found {} variant(s) without standard consequences: {}. Querying distant regions'.format(
logger.info('Found {} variant(s) without standard consequences: {}'.format(
len(variants_without_consequences), '|'.join(variants_without_consequences)))
distant_vep_results = query_vep(variants=variants_without_consequences, search_distance=VEP_LONG_QUERY_DISTANCE)
extract_consequences(vep_results=distant_vep_results, acceptable_biotypes={'protein_coding'},
only_closest=True, results_by_variant=results_by_variant)

# See if there are still variants with no consequences, even up to a wide search window
variants_without_consequences = get_variants_without_consequences(results_by_variant)
if variants_without_consequences:
logger.info('After distant querying, still remaining {} variant(s) without consequences: {}'.format(
len(variants_without_consequences), '|'.join(variants_without_consequences)
))
if enable_distant_querying:
logger.info('Attempting to find distant consequences for the remaining variants')

# If there are, we will now do a second round of querying, this time looking only at protein coding biotypes
# (vs. miRNA *and* protein coding during the first round) up to a distance of 500,000 bases each way.
if variants_without_consequences:
distant_vep_results = query_vep(variants=variants_without_consequences,
search_distance=VEP_LONG_QUERY_DISTANCE)
extract_consequences(vep_results=distant_vep_results, acceptable_biotypes={'protein_coding'},
only_closest=True, results_by_variant=results_by_variant)

# See if there are still variants with no consequences, even up to a wide search window
variants_without_consequences = get_variants_without_consequences(results_by_variant)
if variants_without_consequences:
logger.info('After distant querying, still remaining {} variant(s) without consequences: {}'.format(
len(variants_without_consequences), '|'.join(variants_without_consequences)
))

# Yield all consequences for all variants. Note they are not grouped by variant, all consequences are yielded in a
# common sequence.
Expand All @@ -158,11 +173,15 @@ def process_variants(variants):


def main():
# Parse command line arguments
args = parser.parse_args()

# Load variants to query from STDIN
variants_to_query = [colon_based_id_to_vep_id(v) for v in sys.stdin.read().splitlines()]

# Query VEP with all variants at once (for the purpose of efficiency), print out the consequences to STDOUT.
for variant_id, gene_id, gene_symbol, consequence_term, distance in process_variants(variants_to_query):
consequences = process_variants(variants_to_query, enable_distant_querying=args.enable_distant_querying)
for variant_id, gene_id, gene_symbol, consequence_term, distance in consequences:
# The second column, set statically to 1, is not used, and is maintained for compatibility purposes
print('\t'.join([vep_id_to_colon_id(variant_id), '1', gene_id, gene_symbol, consequence_term, str(distance)]))

Expand Down

0 comments on commit 641bb11

Please sign in to comment.