Deprecated. Updated benchmarks available at https://github.com/Sentieon/sentieon-matching-performance
This documents describes the capabilities of Sentieon DNAseq pipeline matching different versions of GATK germline pipelines. If you have any additional questions, please visit https://www.sentieon.com/support or contact the technical support at Sentieon Inc. at [email protected].
Fastq files of NA12878 were downloaded from FTP site:
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140127_D00360_0011_AHGV6ADXX/Project_RM8398/
Hg38 and other databases were downloaded from GATK resource bundle.
See here: https://software.broadinstitute.org/gatk/download/bundle
Arguments | File |
---|---|
fasta | Homo_sapiens_assembly38.fasta |
known_Mills | Mills_and_1000G_gold_standard.indels.hg38.vcf.gz |
known_1000G | 1000G_phase1.snps.high_confidence.hg38.vcf.gz |
known_dbsnp | dbsnp_146.hg38.vcf.gz |
calling_intervals_list | wgs_calling_regions.hg38.interval_list |
BWA 0.7.15-r1140:
bwa mem -M -Y -K 10000000 \
-R '@RG\tID:NA12878\tSM:NA12878\tPL:ILLUMINA' \
$fasta $fastq1 $fastq2 | \
samtools sort -o sorted.bam
samtools index sorted.bam
Sentieon:
sentieon bwa mem -M -Y -K 10000000 \
-R '@RG\tID:NA12878\tSM:NA12878\tPL:ILLUMINA' \
$fasta $fastq1 $fastq2 | \
sentieon util sort -i - \
-r $fasta -o sorted.bam --sam2bam
GATK3.7/3.8(Picard):
java -jar picard.jar MarkDuplicates \
I=sorted.bam \
O=deduplicated.bam \
M=duplication.metrics \
REMOVE_DUPLICATES=true \
CREATE_INDEX=true
GATK4:
gatk MarkDuplicates \
-I sorted.bam \
-O deduplicated.bam \
-M duplication.metrics \
--REMOVE_DUPLICATES true \
--CREATE_INDEX true
Sentieon:
sentieon driver -r $fasta -i sorted.bam \
--algo LocusCollector --fun score_info score.txt.gz
sentieon driver -r $fasta -i sorted.bam \
--algo Dedup --rmdup --score_info score.txt.gz deduped.bam
GATK 3.7/3.8:
java -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-I deduplicated.bam \
-R $fasta \
--knownSites $known_Mills \
--knownSites $known_1000G \
--knownSites $known_dbsnp \
-o bqsr.grp
java -jar GenomeAnalysisTK.jar \
-T PrintReads \
-R $fasta \
-I deduplicated.bam \
-BQSR bqsr.grp \
-o recalibrated.bam
GATK 4:
gatk BaseRecalibrator \
-I deduplicated.bam \
-R $fasta \
--known-sites $known_Mills \
--known-sites $known_1000G \
--known-sites $known_dbsnp \
-O bqsr.grp
gatk ApplyBQSR \
-R $fasta \
-I deduplicated.bam \
--bqsr-recal-file bqsr.grp \
-O recalibrated.bam
Sentieon*:
sentieon driver -r $fasta \
-i deduped.bam \
--algo QualCal \
-k $known_dbsnp \
-k $known_1000G \
-k $known_Mills \
recal_data.table
*Sentieon variant callers can perform the recalibration on the fly using a pre-recalibration bam plus the recalibration table. Recalibrated bam can be generated by the ReadWriter algo.
# This step is optional
sentieon driver -i deduped.bam -q recal_data.table --algo ReadWriter recaled.bam
Command line to compare GATK and Sentieon DNAseq results:
Output of GATK is used as the baseline.
hap.py \
GATK.vcf.gz \
Sentieon.vcf.gz \
-o output_dir \
-r Homo_sapiens_assembly38.fasta \
--engine=vcfeval \
--engine-vcfeval-template hs38.sdf
Command line:
GATK 3.7/3.8:
java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-ERC GVCF \
-R $fasta \
-L $calling_intervals_list \
-I recalibrated.bam \
-o output.g.vcf.gz
java -jar GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
-R $fasta \
-L $calling_intervals_list \
--variant output.g.vcf.gz \
--dbsnp $known_dbsnp \
-o output.vcf.gz
Sentieon:
sentieon driver -r $fasta \
-i deduped.bam \
-q recal_data.table \
--interval $calling_intervals_list \
--algo Haplotyper \
--emit_mode gvcf \
output.g.vcf.gz
sentieon driver -r $fasta \
--interval $calling_intervals_list \
--algo GVCFtyper \
-v output.g.vcf.gz \
--call_conf 10 \
--emit_conf 10 \
-d $known_dbsnp \
output.vcf.gz
Results:
Type | TRUTH | QUERY | METRIC | |||||
---|---|---|---|---|---|---|---|---|
TOTAL | TP | FN | TOTAL | FP | Recall | Precision | F1_Score | |
INDEL | 848723 | 848238 | 485 | 874360 | 538 | 0.999429 | 0.999385 | 0.999407 |
SNP | 4001821 | 4000797 | 1024 | 4005753 | 1033 | 0.999744 | 0.999742 | 0.999743 |
Command line:
GATK 4.0:
gatk HaplotypeCaller \
-R $fasta \
-L $calling_intervals_list \
-I recalibrated.bam \
-ERC GVCF \
-O output.g.vcf.gz
gatk GenotypeGVCFs \
-R $fasta \
-L $calling_intervals_list \
-V output.g.vcf.gz \
--dbsnp $known_dbsnp \
-O output.vcf.gz
Sentieon:
sentieon driver -r $fasta \
-i deduped.bam \
-q recal_data.table \
--interval $calling_intervals_list \
--algo Haplotyper \
--emit_mode gvcf \
output.g.vcf.gz
sentieon driver -r $fasta \
--interval $calling_intervals_list \
--algo GVCFtyper \
-v output.g.vcf.gz \
--call_conf 10 \
--emit_conf 10 \
-d $known_dbsnp \
output.vcf.gz
Results:
Type | TRUTH | QUERY | METRIC | |||||
---|---|---|---|---|---|---|---|---|
TOTAL | TP | FN | TOTAL | FP | Recall | Precision | F1_Score | |
INDEL | 849960 | 846375 | 3585 | 874364 | 2434 | 0.995782 | 0.997216 | 0.996499 |
SNP | 4003643 | 3998527 | 5116 | 4005750 | 3319 | 0.998722 | 0.999171 | 0.998947 |
Command line:
GATK 4.1:
gatk HaplotypeCaller \
-R $fasta \
-L $calling_intervals_list \
-I recalibrated.bam \
-ERC GVCF \
-O output.g.vcf.gz
gatk GenotypeGVCFs \
-R $fasta \
-L $calling_intervals_list \
-V output.g.vcf.gz \
--dbsnp $known_dbsnp \
-O output.vcf.gz
Sentieon*:
sentieon driver -r $fasta \
-i deduped.bam \
-q recal_data.table \
--interval $calling_intervals_list \
--algo Haplotyper \
--emit_mode gvcf \
output.g.vcf.gz
sentieon driver -r $fasta \
--interval $calling_intervals_list \
--algo GVCFtyper \
-v output.g.vcf.gz \
-d $known_dbsnp \
--genotype_model multinomial \
output.vcf.gz
*Sentieon uses the option --genotype_model multinomial to match the output of the default newQual model in GATK 4.1.
Results:
Type | TRUTH | QUERY | METRIC | |||||
---|---|---|---|---|---|---|---|---|
TOTAL | TP | FN | TOTAL | FP | Recall | Precision | F1_Score | |
INDEL | 855716 | 850790 | 4926 | 894426 | 10869 | 0.994243 | 0.987848 | 0.991035 |
SNP | 3999272 | 3990379 | 8893 | 4006624 | 11826 | 0.997776 | 0.997048 | 0.997412 |
Computing environment:
- Google Compute Engine
- n1-standard-32 (32 vCPUs, 120 GB memory)
- Local SSD Scratch Disk 2x375G
- centos-7-v20190619
Stage | Sentieon | GATK3.8 | GATK4.0 | GATK4.1 |
---|---|---|---|---|
Alignment | 2:42:44 | 5:38:35 | 5:49:39 | 5:45:39 |
Dedup | 0:06:16 | 4:04:25 | 2:11:43 | 2:06:32 |
BQSR | 0:10:10 | 4:17:09 | 1:39:57 | 1:40:06 |
HaplotypeCaller | 0:41:02 | 3:21:37 | 6:56:53 | 5:37:52 |
GenotypeGVCFs | 0:00:55 | 2:04:08 | 2:02:55 | 2:05:22 |
Total | 3:41:07 | 19:25:54 | 18:41:07 | 17:15:31 |
Sentieon SpeedUp | -- | 5.3X | 5.1X | 4.7X |
The benchmark was performed on two different instances. Both instances have Intel® Xeon® Platinum 8124M CPU @ 3.00GHz with dual stripped NVMe SSD.
Intance | vCPU | Memory |
---|---|---|
c5d.9xlarge | 36 | 72GB |
c5d.18xlarge | 72 | 144GB |
On both instances, HG001 30x was processed and completed in less than 90 core-hours.
Machine | c5d.9xlarge | c5d.18xlarge | ||
Stage | time (hh:mm) | core*hours | time(hh:mm) | core*hours |
Alignment | 01:41 | 60.67 | 00:54 | 65.12 |
LocusCollector | 00:01 | 0.93 | 00:01 | 1.5 |
Dedup | 00:03 | 1.47 | 00:03 | 2.48 |
BQSR | 00:05 | 3.14 | 00:03 | 3.56 |
HC | 00:24 | 14.41 | 00:13 | 16.16 |
GVCFtyper | 00:01 | 0.3 | 00:01 | 0.34 |
Total | 02:24 | 80.92 | 01:24 | 89.16 |
For this evaluation, we used both HG001 and HG002 with depth of about 50x from the PrecisionFDA truth challenge. Reference b37 is used for this benchmark.
Sample | Reads 1 | Reads 2 |
---|---|---|
HG001 (50x) | HG001-NA12878-50x_1.fastq.gz | HG001-NA12878-50x_2.fastq.gz |
HG002 (50x) | HG002-NA24385-50x_1.fastq.gz | HG002-NA24385-50x_2.fastq.gz |
The truthset of HG001 and HG002 can be found at Giab latest release page.
Name | File |
---|---|
HG001 VCF | HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz |
HG001 BED | HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel.bed |
HG002 VCF | HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased.vcf.gz |
HG002 BED | HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_noinconsistent.bed |
Sample | Type | TP | FN | FP | Recall | Precision | F1_Score |
---|---|---|---|---|---|---|---|
HG001 | INDEL | 359926 | 3112 | 10133 | 0.9914 | 0.9726 | 0.9819 |
SNP | 2785549 | 1741 | 7236 | 0.9994 | 0.9974 | 0.9984 | |
HG002 | INDEL | 462614 | 806 | 1085 | 0.9983 | 0.9977 | 0.9980 |
SNP | 3046197 | 1640 | 5339 | 0.9995 | 0.9983 | 0.9989 |
Using Sentieon DNAscope with machine learning model, we are able to further improve the variant calling accuracy. Please see DNAscope Machine Learning Model for more details.