Merge pull request #1 from tskir/eva-1831-distal-variants

EVA-1831 — Disable distant querying by default
EBIvariation · Feb 18, 2020 · 641bb11 · 641bb11
2 parents 68245c8 + 742307d
commit 641bb11
Show file tree

Hide file tree

Showing 2 changed files with 46 additions and 25 deletions.
diff --git a/README.md b/README.md
@@ -17,7 +17,7 @@ The pipeline uses succinct, VCF-compatible variant identifiers, in the format of
 ### Output file
 Output is a TSV file consisting of six columns:
 1. Variant identifier: the same one as in the input files.
-2. The second column is not used and is always set to 1. It is retained for compabitility purposes (see more about that below).
+2. The second column is not used and is always set to 1. It is retained for compabitility purposes (see more about that in the release notes).
 3. Ensembl gene ID.
 4. Ensembl gene name.
 5. Most severe functional consequence of that variant for that gene.
@@ -33,15 +33,15 @@ Example (note that tabs have been replaced with spaces here for readability):
 ```
 
 ### Running the script directly (on a small number of entries, for testing/debugging only)
-The pipeline core module is [consequence_mapping.py](/bin/consequence_mapping/consequence_mapping.py). It reads data from STDIN and writes to STDOUT in the formats described above. It can be run as follows:
+The pipeline core module is [consequence_mapping.py](/vep_mapping_pipeline/consequence_mapping.py). It reads data from STDIN and writes to STDOUT in the formats described above. It can be run as follows:
 ```bash
 python3 consequence_mapping.py <input_variants.txt >output_mappings.tsv
 ```
 
 This should only be done for testing purposes and only for a small number of variants, because when querying VEP API, the script submits of all of the variants it receives in a single query (this is more efficient than submitting them one by one).
 
 ### Running the pipeline using a wrapper script
-In production environment the pipeline should be run using a wrapper script which would take care of preprocessing and parallelisation. There is a simple wrapper script available, [run_consequence_mapping.sh](/bin/consequence_mapping/run_consequence_mapping.sh). It can be run as follows:
+In production environment the pipeline should be run using a wrapper script which would take care of preprocessing and parallelisation. There is a simple wrapper script available, [run_consequence_mapping.sh](/vep_mapping_pipeline/run_consequence_mapping.sh). It can be run as follows:
 ```bash
 bash run_consequence_mapping.sh input_variants.vcf output_mappings.tsv
 ``` 
@@ -60,8 +60,10 @@ The wrapper script depends on `bcftools` and `parallel` (GNU Parallel).
 
 ## Mapping process
 For each variant:
-* Query VEP with default distance to the gene (5,000 bases either way). Consider only consequences which affect **either a protein coding transcript or a miRNA**. Find the most severe consequence for either of those transcripts, according to the list of priorities described [on Ensembl website](https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html). Output **all non-redundant** consequences of this (most severe) type.
-* If we found nothing during the previous step, repeat VEP search for this variant, now with a distance up to 500,000 bases either way. If there are any consequences which affect a **protein coding** transcript, choose the most severe consequence type (usually this will be either an upstream or a downstream gene variant) and output **a single consequence with the smallest distance**.
+1. Query VEP with default distance to the gene (5,000 bases either way). Consider only consequences which affect **either a protein coding transcript or a miRNA**. Find the most severe consequence for either of those transcripts, according to the list of priorities described [on Ensembl website](https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html). Output **all non-redundant** consequences of this (most severe) type.
+2. *Optionally:* if we found nothing during the previous step, repeat VEP search for this variant, now with a distance up to 500,000 bases either way. If there are any consequences which affect a **protein coding** transcript, choose the most severe consequence type (usually this will be either an upstream or a downstream gene variant) and output **a single consequence with the smallest distance**.
+
+Distant querying (Step 2), while disabled by default, can be enabled by appending `--enable-distant-querying` flag to the core pipeline.
 
 ## Note on porting & changes from the original pipeline
 This pipeline originated as a Python port of the original Open Targets ["SNP to gene"](https://github.com/opentargets/snp_to_gene) pipeline. An effort has been taken to retain backwards compatibility where possible; however, many important changes have been introduced. Please see the release notes for description of those changes.

diff --git a/vep_mapping_pipeline/consequence_mapping.py b/vep_mapping_pipeline/consequence_mapping.py
@@ -1,10 +1,8 @@
 #!/usr/bin/env python3
+"""Pipeline for mapping variants to the genes they affect and their functional consequences, using Ensembl VEP API. For
+documentation, refer to /README.md"""
 
-"""
-Pipeline for mapping variants to the genes they affect and their functional consequences, using Ensembl VEP API.
-For documentation, refer to /README.md
-"""
-
+import argparse
 import itertools
 import json
 import logging
@@ -14,6 +12,12 @@
 
 from retry import retry
 
+parser = argparse.ArgumentParser(description=__doc__)
+parser.add_argument(
+    '--enable-distant-querying', action='store_true',
+    help='Enables a second iteration of querying VEP for distant gene variants, which is disabled by default'
+)
+
 logging.basicConfig()
 logger = logging.getLogger('consequence_mapping')
 logger.setLevel(logging.INFO)
@@ -121,9 +125,14 @@ def get_variants_without_consequences(results_by_variant):
     })
 
 
-def process_variants(variants):
+def process_variants(variants, enable_distant_querying=False):
     """Given a list of variant IDs, return a list of consequence types (each including Ensembl gene name & ID and a
-    functional consequence code) for a given variant."""
+    functional consequence code) for a given variant.
+
+    Args:
+        enable_distant_querying: If set to True, an additional VEP query will be performed for variants for which no
+            consequences were found during the first iteration, in an attempt to find distant variant consequences.
+    """
 
     # First, we query VEP with default parameters, looking for variants affecting protein coding and miRNA transcripts
     # up to a standard distance (5000 nucleotides either way, which is default for VEP) from the variant.
@@ -134,21 +143,27 @@ def process_variants(variants):
 
     # See if there are variants with no consequences up to the default distance
     variants_without_consequences = get_variants_without_consequences(results_by_variant)
-    # If there are, we will now do a second round of querying, this time looking only at protein coding biotypes (vs.
-    # miRNA *and* protein coding during the first round) up to a distance of 500,000 bases each way.
     if variants_without_consequences:
-        logger.info('Found {} variant(s) without standard consequences: {}. Querying distant regions'.format(
+        logger.info('Found {} variant(s) without standard consequences: {}'.format(
             len(variants_without_consequences), '|'.join(variants_without_consequences)))
-        distant_vep_results = query_vep(variants=variants_without_consequences, search_distance=VEP_LONG_QUERY_DISTANCE)
-        extract_consequences(vep_results=distant_vep_results, acceptable_biotypes={'protein_coding'},
-                             only_closest=True, results_by_variant=results_by_variant)
 
-    # See if there are still variants with no consequences, even up to a wide search window
-    variants_without_consequences = get_variants_without_consequences(results_by_variant)
-    if variants_without_consequences:
-        logger.info('After distant querying, still remaining {} variant(s) without consequences: {}'.format(
-            len(variants_without_consequences), '|'.join(variants_without_consequences)
-        ))
+        if enable_distant_querying:
+            logger.info('Attempting to find distant consequences for the remaining variants')
+
+            # If there are, we will now do a second round of querying, this time looking only at protein coding biotypes
+            # (vs. miRNA *and* protein coding during the first round) up to a distance of 500,000 bases each way.
+            if variants_without_consequences:
+                distant_vep_results = query_vep(variants=variants_without_consequences,
+                                                search_distance=VEP_LONG_QUERY_DISTANCE)
+                extract_consequences(vep_results=distant_vep_results, acceptable_biotypes={'protein_coding'},
+                                     only_closest=True, results_by_variant=results_by_variant)
+
+            # See if there are still variants with no consequences, even up to a wide search window
+            variants_without_consequences = get_variants_without_consequences(results_by_variant)
+            if variants_without_consequences:
+                logger.info('After distant querying, still remaining {} variant(s) without consequences: {}'.format(
+                    len(variants_without_consequences), '|'.join(variants_without_consequences)
+                ))
 
     # Yield all consequences for all variants. Note they are not grouped by variant, all consequences are yielded in a
     # common sequence.
@@ -158,11 +173,15 @@ def process_variants(variants):
 
 
 def main():
+    # Parse command line arguments
+    args = parser.parse_args()
+
     # Load variants to query from STDIN
     variants_to_query = [colon_based_id_to_vep_id(v) for v in sys.stdin.read().splitlines()]
 
     # Query VEP with all variants at once (for the purpose of efficiency), print out the consequences to STDOUT.
-    for variant_id, gene_id, gene_symbol, consequence_term, distance in process_variants(variants_to_query):
+    consequences = process_variants(variants_to_query, enable_distant_querying=args.enable_distant_querying)
+    for variant_id, gene_id, gene_symbol, consequence_term, distance in consequences:
         # The second column, set statically to 1, is not used, and is maintained for compatibility purposes
         print('\t'.join([vep_id_to_colon_id(variant_id), '1', gene_id, gene_symbol, consequence_term, str(distance)]))