Multi-platform assessment of DNA sequencing performance in the ABRF Next-Generation Sequencing Study
Analysis and figure generation code for the ABRF NGS Phase II Study on DNA-seq reproducibility. This repository includes scripts to run heavy lifting such as alignment and variant calling (SLURM), shell scripts to do post-processing calculations (bin), and R scripts used to create figures (Rmds).
- BBMap
- bedtools
- DeepVariant
- FASTQC
- GATK4
- Guppy (see documentation on Nanopore Community page)
- minimap2
- Picard
- RTG Tools
- samtools
- Sentieon
- Strelka2
This study requires several resources in the reference
directory. Included in this repo:
- GRCh38 reference genome
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
- As recommended by Heng Li's Which human reference genome to use?
- RepeatMasker tracks from UCSC Table Browser
Required:
- Download dbSNP VCF
GCF_000001405.25.gz
- Download SDF reference for RTG analysis
GRCh38.sdf.zip
Written in the form of SLURM scripts (see SLURM Documentation), but the core code can easily be taken out and run directly.
- Quality Control with
FASTQC.slurm
- Reference-based alignment
- Cell line-specific alignment
- using Sentieon with
sentieon.slurm
- using GATK with
GATK.slurm
- using Sentieon with
- Bacteria-specific alignment using Sentieon with
sentieon_bacteria.slurm
- Long read alignment with
minimap2.slurm
- Get alignment statistics with
BAMstats.slurm
- See
tables-mapping.sh
andtables-error.sh
below for creating tables with statistics based on alignments
- Cell line-specific alignment
- Downsample BAMs with
downsampleBAMs.slurm
- Variant calling
- Sentieon-based Haplotyper with
sentieonHaplotyper_on_downsampled.slurm
- Strelka2 with
strelka2.slurm
- DeepVariant with
DeepVariant.slurm
- GATK variant calling embedded in GATK script above
- Sentieon-based Haplotyper with
- Other analyses
- Calculate mismatch rate in BAM with
mismatchRate.slurm
- Estimate variant detection sensitivity and precision with
RTG.slurm
- Call FASTQs from ONT FAST5 data with
guppy.slurm
- Calculate mismatch rate in BAM with
The code used to generate all figures (primary and Extended Data) are provided in the Rmds
directory. (With the exception of Figure 1a, which was created in Adobe Illustrator.) Rmds
includes:
- Figure 1 (Depth of sequencing and mapping rate) with
QCandMapping.R
- Extended Data 1, 2 (see
DecoyMapping.R
)
- Extended Data 1, 2 (see
- Figure 2 (Genome Coverage) with
Coverage.R
- Extended Data 3, 10
- Figure 3 (Mismatch rates) with
Mismatch.R
- Figure 4 (Variant Detection) with
Variants.R
- Extended Data 4, 5 (see
VariantAlleles.R
), 6 (seeMendelViolations.R
)
- Extended Data 4, 5 (see
- Figure 5 (Structural Variants) with
SVs.R
- Extended Data 7, 8, 9
- Figure 6 (Bacterial Sequencing) with
Bacteria.R
The bin
directory contains python and shell scripts that enable primary analyses above. These include:
Script | Function |
---|---|
calculateMismatch.sh | Calculate mismatch rates in homopolymer and STR contexts |
Mendel_upSetMatrixGen.py | Create UpSet plots for Mendelian violations |
Mendel_violationsByType.py | Parse outputs of VBT for Mendelian violations |
mendel.sh | Run VBT Mendelian violations |
tables-error.sh | Generate mismatch histograms via BBMap |
tables-mapping.sh | Several functions to create tables with alignment statistics |
tables-variants.sh | Several functions to create tables with variant detection statistics |
variantAllele_GTtoMatrix.py | Convert genotype matrix to TSV for plotting |
Please see XXX for publication.
The genome sequences in this study are available as EBV-immortalized B-lymphocyte cell lines (from Coriell) as well as from DNA (from Coriell and NIST). All data generated within this study from these genomes are publicly available on NCBI Sequence Read Archive (SRA) under the BioProject PRJNA646948, within accessions SRR12898279-12898354.
You can cite our work as follows: [tk]