Skip to content
/ abrfngs2 Public

Analysis and figure generation code for the ABRF NGS Phase II Study on DNA-seq reproducibility

Notifications You must be signed in to change notification settings

jfoox/abrfngs2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-platform assessment of DNA sequencing performance in the ABRF Next-Generation Sequencing Study

Introduction

Analysis and figure generation code for the ABRF NGS Phase II Study on DNA-seq reproducibility. This repository includes scripts to run heavy lifting such as alignment and variant calling (SLURM), shell scripts to do post-processing calculations (bin), and R scripts used to create figures (Rmds).

Requirements

Reference materials

This study requires several resources in the reference directory. Included in this repo:

  1. GRCh38 reference genome GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
  2. RepeatMasker tracks from UCSC Table Browser

Required:

  1. Download dbSNP VCF GCF_000001405.25.gz
  2. Download SDF reference for RTG analysis GRCh38.sdf.zip

Primary analyses

Written in the form of SLURM scripts (see SLURM Documentation), but the core code can easily be taken out and run directly.

  1. Quality Control with FASTQC.slurm
  2. Reference-based alignment
  3. Downsample BAMs with downsampleBAMs.slurm
  4. Variant calling
  5. Other analyses

Figure Generation

The code used to generate all figures (primary and Extended Data) are provided in the Rmds directory. (With the exception of Figure 1a, which was created in Adobe Illustrator.) Rmds includes:

  1. Figure 1 (Depth of sequencing and mapping rate) with QCandMapping.R
  2. Figure 2 (Genome Coverage) with Coverage.R
    • Extended Data 3, 10
  3. Figure 3 (Mismatch rates) with Mismatch.R
  4. Figure 4 (Variant Detection) with Variants.R
  5. Figure 5 (Structural Variants) with SVs.R
    • Extended Data 7, 8, 9
  6. Figure 6 (Bacterial Sequencing) with Bacteria.R

Helper scripts

The bin directory contains python and shell scripts that enable primary analyses above. These include:

Script Function
calculateMismatch.sh Calculate mismatch rates in homopolymer and STR contexts
Mendel_upSetMatrixGen.py Create UpSet plots for Mendelian violations
Mendel_violationsByType.py Parse outputs of VBT for Mendelian violations
mendel.sh Run VBT Mendelian violations
tables-error.sh Generate mismatch histograms via BBMap
tables-mapping.sh Several functions to create tables with alignment statistics
tables-variants.sh Several functions to create tables with variant detection statistics
variantAllele_GTtoMatrix.py Convert genotype matrix to TSV for plotting

End notes

Please see XXX for publication.

The genome sequences in this study are available as EBV-immortalized B-lymphocyte cell lines (from Coriell) as well as from DNA (from Coriell and NIST). All data generated within this study from these genomes are publicly available on NCBI Sequence Read Archive (SRA) under the BioProject PRJNA646948, within accessions SRR12898279-12898354.

You can cite our work as follows: [tk]

About

Analysis and figure generation code for the ABRF NGS Phase II Study on DNA-seq reproducibility

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published