WARNING: Not actively maintained!
The scripts in this portion of the repository were used to ingest, reshape, and store variant annotations in a cloud analysis-ready format. You can use them to bring in a fresh copy of an annotation resource or as a starting point for curation of a new annotation resource.
This code currently works with annotation resources such as dbSNP and ClinVar along with variant allele frequencies from NHLBI GO Exome Sequencing Project (ESP), 1000 Genomes, ExAC, and Genome Aggregation Database (gnomAD) but similar techniques could be applied to other annotation resources.
All steps are run in the cloud, but each individual step is launched manually.
Many variant annotation sources are encoded as VCF files. Therefore we can use Google Genomics to import the resource and export it to BigQuery.
Follow the tutorial to run a dsub script to create individual tables holding dbSNP, ClinVar, ESP, etc.
A table with annotations for all possible SNPs of a particular genome reference is useful for:
- Examining SNP variation across different regions of the genome.
- Quickly annotating the SNPs for a cohort using a simple JOIN.
- Generating synthetic sequence variant datasets using the SNP allele frequencies from this table.
Follow the tutorial to create an all-possible-SNPs tables for build 38 of the human genome reference.
The variants
table generated by performing
an
export from Google Genomics does
not include the field descriptions for the fields.
See add BigQuery descriptions for instructions on how to automatically populate the BigQuery schema description with the information from the VCF header.