📖 Table of Contents
The Great Genotyper is a population genotyping workflow. The workflow begins by preprocessing 4.2K short-read samples of 183TB raw data to create an 867GB Counting Colored De Bruijn Graph. The Great Genotyper uses the succinct CCDG to genotype any list of phased or unphased variants, leveraging the population information to increase both precision and recall. The Great Genotyper offers the same accuracy as the state-of-the-art with unprecedented performance. It took 100 hours to genotype 4.5M variants in the 4.2K samples using one server with 32 cores and 145GB of memory. A similar task would take months or even years using single-sample genotypers with the same computational resources.
git clone https://github.com/dib-lab/TheGreatGenotyper.git
cd TheGreatGenotyper/
conda env create -f environment.yml
conda activate gg
conda env config vars set CPATH=${CONDA_PREFIX}/include:${CPATH}
conda activate
conda activate gg
# Run CMake configure
cmake -Bbuild
# Run make with parallel execution.
cmake --build build -j4 # -j4 = execute 4 recipes simultaneously.
wget https://faculty.washington.edu/browning/beagle/beagle.22Jul22.46e.jar
wget https://bochet.gcc.biostat.washington.edu/beagle/genetic_maps/plink.GRCh38.map.zip
unzip plink.GRCh38.map.zip
uncompress test_data
gzip -d test_data/GRCh38_chr21.fa.gz test_data/test.vcf.gz test_data/test.unphased.vcf.gz
Download index(50G) for Simons Genome Diversity population(SGDP).
mkdir -p index/SGDP/
cd index/SGDP/
wget https://farm.cse.ucdavis.edu/~mshokrof/indexes/SGDP/graph.dbg
wget https://farm.cse.ucdavis.edu/~mshokrof/indexes/SGDP/graph.desc.tsv
wget https://farm.cse.ucdavis.edu/~mshokrof/indexes/SGDP/annotation.relaxed.row_diff_int_brwt.annodbg
wget https://farm.cse.ucdavis.edu/~mshokrof/indexes/SGDP/samples.csv
cd -
echo "index/SGDP/" > indexes
Run The great genotyper. It took 12 mins using 32 core and 85GB ram.
mkdir test_output
./build/pangenie/src/TheGreatGenotyper -g -i indexes -j 32 -t 32 -r test_data/GRCh38_chr21.fa -y emissions -v test_data/test.vcf -o - 2> log | bgzip > test_output/test.vcf.gz
tabix -p vcf test_output/test.vcf.gz
java -Xmx40G -jar beagle.22Jul22.46e.jar gt=test_output/test.vcf.gz out=test_output/test.phased nthreads=32 map=plink.chr21.GRCh38.map
tabix -p vcf test_output/test.phased.vcf.gz
bcftools +fill-tags test_output/test.phased.vcf.gz -Oz -o test_output/test.phased.tagged.vcf.bgz -- -t all
tabix -p vcf test_output/test.phased.tagged.vcf.bgz
Note: "-a" parameter is added to use kmer-only genotyping model.
mkdir test_output
./build/pangenie/src/TheGreatGenotyper -a -g -i indexes -j 32 -t 32 -r test_data/GRCh38_chr21.fa -y emissions -v test_data/test.unphased.vcf -o - 2> log | bgzip > test_output/test.vcf.gz
tabix -p vcf test_output/test.vcf.gz
java -Xmx40G -jar beagle.22Jul22.46e.jar gt=test_output/test.vcf.gz out=test_output/test.phased nthreads=32 map=plink.chr21.GRCh38.map
tabix -p vcf test_output/test.phased.vcf.gz
bcftools +fill-tags test_output/test.phased.vcf.gz -Oz -o test_output/test.phased.tagged.vcf.bgz -- -t all
tabix -p vcf test_output/test.phased.tagged.vcf.bgz
After running the Green workflow, Use the workflow output to phase the input variants and rerun the great genotyper as in the red workflow
java -Xmx40G -jar beagle.22Jul22.46e.jar gt=test_data/test.unphased.vcf ref=test_output/test.phased.vcf.gz out=test_data/test.GG.phased nthreads=32 map=plink.chr21.GRCh38.map
gzip -d test_data/test.GG.phased.vcf.gz
./build/pangenie/src/TheGreatGenotyper -g -i indexes -j 32 -t 32 -r test_data/GRCh38_chr21.fa -y emissions -v test_data/test.GG.phased.vcf -o - 2> log | bgzip > test_output/secondpass.vcf.gz
tabix -p vcf test_output/secondpass.vcf.gz
java -Xmx40G -jar beagle.22Jul22.46e.jar gt=test_output/secondpass.vcf.gz out=test_output/secondpass.phased nthreads=32 map=plink.chr21.GRCh38.map
tabix -p vcf test_output/secondpass.phased.vcf.gz
bcftools +fill-tags test_output/secondpass.phased.vcf.gz -Oz -o test_output/secondpass.phased.tagged.vcf.bgz -- -t all
tabix -p vcf test_output/secondpass.phased.tagged.vcf.bgz
The previous examples utilize a small index of 276 samples. The full index, which contains 4.2K samples encompassing the 1000 Genome Samples, Human Diversity Project, and Simons Diversity Project, is available at Full Index. This index comprises 29 sub-indexes, each stored in a separate folder. A script is provided to assist with downloading all the indexes. Metadata for the samples in the index can be found at index_metadata.csv.
The full index may require a significant amount of memory when dealing with a large number of variants. We have developed a Snakemake workflow that operates on each sub-index independently, thereby avoiding any population-wide filtering. Following this, it integrates the results from each sub-index, performs population-level filtering, and carries out Beagle imputation. The Snakemake script is available at this link.
You will need to specify the input in the configuration file, an example of which can be found here. You can run the workflow using the following command:
snakemake -j 5 --configfile config.yaml --use-conda
Query the frequent variants
bcftools view -q 0.9 test_output/test.phased.tagged.vcf.bgz |grep -vP "^#" |head
Query the rare variants
bcftools view -Q 0.1 test_output/test.phased.tagged.vcf.bgz |grep -vP "^#" |head
Manipulating VCF files containing genotypes for hundreds or thousands of samples can be a challenging task. To effectively navigate and analyze such data, we recommend utilizing Hail as a tool for exploring the resulting VCF file. I prepared a jupyter notebook containing examples of tasks to be done using hail like: stratifying allele frequencies by population, and plotting the PCA of the genotypes. The following figure is the PCA plot generated by Hail using "test_output/merged.vcf.bgz"
To run the notebook, You need to install :
- jupyter: https://jupyterlab.readthedocs.io/en/stable/getting_started/installation.html
- Hail: https://hail.is/docs/0.2/getting_started.html
The notebook needs two input files: vcf file("test_output/test.phased.tagged.vcf.bgz") and samples description.
Moustafa Shokrof
C.Titus Brown
Tamer Mansour
Licensed under GPLv3-Clause.