Skip to content

Variant Annotation

javild edited this page Nov 17, 2016 · 25 revisions

Overview

CellBase can take advantage of the data integrated to implement a rich and high-performance variant annotator. The variant annotation tool is integrated within the CellBase code and can be accessed in two different ways:

  • Using remote RESTful web services: both GET and POST annotation web services are available (see http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/). By avoiding local installation of the knowledge base, users do not need to store hundreds of Gigabytes (about 900GB in current release v4) and will always be automatically updated. Web services based annotation results are returned in the form of JSON objects.
  • Using the Java command line: current Java CLI can connect to either remote web services or efficiently fetch annotation data directly from a custom installation of the database. Even when connecting to remote web services, the annotation CLI provides a lightweight efficient multi-threaded implementation which outperforms other local variant annotators (see Benchmark results below)

The typical input for the CellBase variant annotator will be a VCF file, although the CLI also offers the possibility to explicitly provide a short list of variants as an argument for fast annotation. Two different output formats can be currently generated by the annotator: a .json file with a list of VariantAnnotation objects (see Variant and VariantAnnotation models at https://github.com/opencb/biodata/tree/develop/biodata-models/src/main/resources/avro), or a tab separated values file with the VEP formatted output.

Data sources

Data provided by the variant annotator is the result of integrating most of the annotations available at the CellBase knowledge base: ENSEMBL's core transcript annotation such as location, id, strand, biotype,etc.; protein annotation provided by UniProt, InterPro, SIFT and PolyPhen; population frequencies provided by the European Variation Archive for The 1000 Genomes Project Phase 3, The Exome Server Project (EVS), The Exome Aggregation Consortium v3 (ExaC) and The Genomes of the Netherlands (GoNL); sequence conservation from PhastCons and PhyloP; gene expression values from The Genome Expression Atlas and The Genotype-Tissue Expression project (GTEx); gene drug interaction data from The Drug Gene Interaction Database (DGIdb) and the Human Phenotype Ontology database (HPO); clinical variants annotation from ClinVar and The Catalogue of Somatic Mutations in Cancer (COSMIC). Sequence effect prediction is also calculated on the fly and described by Sequence Ontology (SO) terms. We are constantly working to integrate new data sources in the knowledgebase.

Benchmark

Exhaustive comparison of sequence effect predictions was made against VEP (83) results for the whole 1000 Genome Phase 3 variant set (83 million variants, 346 million effect predictions), yielding a 99.999% of concordance with Ensembl VEP Consequence Types.

  • VEP annotations: 346M
  • CellBase annotations: 346M
  • Coincidence at SO term level (346M annotations)
  • Annotations provided by VEP and not provided by CellBase:
    • 3364 (99.999% coincidence)
    • 61% (2060) of these due to differences on miRNA data sources
    • 39% Difficulties with VEP output format parsing
  • Annotations provided by CellBase and not provided by VEP:
    • 4918 (99.999% coincidence)
    • 60% (2970) of these due to differences on miRNA data sources
    • Difficulties with VEP output format parsing
  • Coincidence at variant level (83M variants)
  • Variants with conflicting annotation: 4990 (99.994% coincidence)

Custom annotations

CellBase variant annotations can be complemented with custom annotations provided by the user. The variant annotation CLI allows to provide a VCF file with custom annotation in the INFO column.