Skip to content

keyalone/awesome-bioinformatics-benchmarks

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Bioinformatics Benchmarks Build Status

A curated list of bioinformatics benchmarking papers and resources.

The credit for this format goes to Sean Davis for his awesome-single-cell repository and Ming Tang for his ChIP-seq-analysis repository.

If you have a benchmarking study that is not yet included on this list, please make a Pull Request.

Contents

Rules for Included Papers

  • Papers must be objective comparisons of 3 or more tools/methods.
  • Papers must be awesome. This list isn't meant to chronicle every benchmarking study ever performed, only those that are particularly expansive, well done, and/or provide unique insights.
  • Papers should generally not be from authors showing why their tool/method is better than others.
  • Benchmarking data should be publicly available or simulation code/methods must be well-documented and reproducible.

Additional guidelines/rules may be added as necessary.

Format & Organization

Please include the following information when adding papers.

Title:

Authors:

Journal Info:

Description:

Tools/methods compared:

Recommendation(s):

Additional links (optional):

Papers within each section should be ordered by publication date, with more recent papers listed first.

Benchmarking Theory

Title: Essential guidelines for computational method benchmarking

Authors: Lukas Weber, et al.

Journal Info: Genome Biology, June 2019

Description: This paper presents 10 main guidelines for conducting and writing benchmark papers which data, methods and metric choices, reproducible research, and documentation


Title: Systematic benchmarking of omics computational tools

Authors: Sergei Mangul, et al.

Journal Info: Nature Communications, March 2019

Description: A survey of 25 benchmarking studies published between 2011 and 2017 in terms of design, methods, and information types. Discusses overfitting, sharing, incentives.

Tool/Method Sections

Additional sections/sub-sections can be added as needed.

DNase, ATAC, and ChIP-seq

Peak Callers

Title: Features that define the best ChIP-seq peak calling algorithms

Authors: Reuben Thomas, et al.

Journal Info: Briefings in Bioinformatics, May 2017

Description: This paper compared six peak calling methods on 300 simulated and three real ChIP-seq data sets across a range of significance values. Methods were scored by sensitivity, precision, and F-score.

Tools/methods compared: GEM, MACS2, MUSIC, BCP, Threshold-based method (TM), ZINBA.

Recommendation(s): Varies. BCP and MACS2 performed the best across all metrics on the simulated data. For Tbx5 ChIP-seq, GEM performed the best, with BCP also scoring highly. For histone H3K36me3 and H3K4me3 data, all methods performed relatively comparably with the exception of ZINBA, which the authors could not get to run properly. MUSIC and BCP had a slight edge over the others for the histone data.

More generally, they found that methods that utilize variable window sizes and Poisson test to rank peaks are more powerful than those that use a Binomial test.


Title: A Comparison of Peak Callers Used for DNase-Seq Data

Authors: Hashem Koohy, et al.

Journal Info: PLoS ONE, May 2014

Description: This paper compares four peak callers specificity and sensitivity on DNase-seq data from two publications composed of three cell types, using ENCODE data for the same cell types as a benchmark. The authors tested multiple parameters for each caller to determine the best settings for DNase-seq data for each.

Tools/methods compared: F-seq, Hotspot, MACS2, ZINBA.

Recommendation(s): F-seq was the most sensitive, though MACS2 and Hotspot both performed competitively as well. ZINBA was the least performant by a massive margin, requiring much more time to run, and was also the least sensitive.

Normalization Methods

Title: ATAC-seq normalization method can significantly affect differential accessibility analysis and interpretation

Authors: Jake J. Reske, et al.

Journal Info: Epigenetics & Chromatin, April 2020

Description: This paper compares the effect of normalization method during differential ATAC-seq analysis.

Tools/methods compared: MACS2, DiffBind, csaw, voom, DEseq2, edgeR, limma

Recommendation(s): This paper compares 8 analytical approaches to calculate ATAC-seq differential accessibility (the description of different combination can be seen in paper Table1). The authors found different analytical approaches can produce very differential chromatin accessibility results using MA-plots.

The authors also proposed a generalized workflow for differential accessibility analysis, which can be found in Github

Additional links: For ATAC-Seq data anlysis, there is another paper: From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis

RNA-seq

Alignment/Quantification Methods

Title: Alignment and mapping methodology influence transcript abundance estimation

Authors: Avi Srivastava* & Laraib Malik*, et al.

Journal Info: bioRXiv, October 2019

Description: This paper compares the influence of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential expression analysis.

Tools/methods compared: Bowtie2, STAR, quasi-mapping, Selective Alignment, RSEM, Salmon.

Recommendation(s): When trying to choose an approach, a choice can be made by the user performing the analysis based on any time-accuracy tradeoff they wish to make. In terms of speed, quasi-mapping is the fastest approach, followed by Selective Alignment (SA) then STAR. Bowtie2 was considerably slower than all three of these approaches. However, in terms of accuracy, SA yielded the best results, followed by alignment to the genome (with subsequent transcriptomic projection) using STAR and SA (using carefully selected decoy sequences). Bowtie2 generally performed similarly to SA, but without the benefit of decoy sequences, seemed to admit more spurious mappings. Finally, lightweight mapping of sequencing reads to the transcriptome showed the lowest overall consistency with quantifications derived from the oracle alignments. Note: Both Selective Alignment and quasi-mapping are part of the salmon codebase.


Title: A comprehensive evaluation of ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification

Authors: Shanrong Zhao* & Baohong Zhang

Journal Info: BMC Genomics, February 2015

Description: This paper compares the effect of different gene annotations in the context of RNA-seq mapping and gene quantification using data from the Human Body Map 2.0 Project.

Tools/methods compared: Ensembl, Refseq, UCSC.

Recommendation(s): Though the authors warn there is no "best" set of annotations to use, they do emphasize the impact that annotation choice can have on downstream analyses such as differential gene expression, as genes with identical gene symbols can map to completely different regions in different annotations. Though Ensembl annotations are much more comprehensive than the others, the authors recommend a less complex genome annotation, such as the Refseq annotation, if the RNA-seq is being used as a replacement for microarrays. Conversely, the Ensembl annotations are preferrable if non-coding RNAs are of particular interest.

Normalisation Methods

Title: Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment

Authors: Marek Gierliński*, Christian Cole*, Pietá Schofield*, Nicholas J. Schurch*, et al.

Journal Info: Bioinformatics, November 2015

Description: This paper compares the effect of normal, log-normal, and negative binomial distribution assumptions on RNA-seq gene read-counts from 48 RNA-seq replicates.

Tools/methods compared: normal, log-normal, negative binomial.

Recommendation(s): Assuming a normal distribution leads to a large number of false positives during differential gene expression. A log-normal distribution model works well unless a sample contains zero counts. Use tools that assume a negative binomial distribution (edgeR, DESeq, DESeq2, etc).


Title: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis.

Authors: Marie-Agnès Dillies, et al.

Journal Info: Briefings in Bioinformatics, November 2013

Description: This paper compared seven RNA-seq normalization methods in the context of differential expression analysis on four real datasets and thousands of simulations.

Tools/methods compared: Total Count (TC), Upper Quartile (UQ), Median (Med), DESeq, edgeR, Quantile (Q), RPKM.

Recommendation(s): The authors recommend DESeq (DESeq2 now available as well) or edgeR, as those methods are robust to the presence of different library sizes and compositions, whereas the (still common) Total Count and RPKM methods are ineffective and should be abandoned.

Differential Gene Expression

Title: How well do RNA-Seq differential gene expression tools perform in a complex eukaryote? A case study in Arabidopsis thaliana.

Authors: Kimon Froussios*, Nick J Schurch*, et al.

Journal Info: Bioinformatics, February 2019

Description: This paper compared nine differential gene expression tools (and their underlying model distributions) in 17 RNA-seq replicates of Arabidopsis thaliana. Handling of inter-replicate variability and false positive fraction were the benchmarking metrics used.

Tools/methods compared: baySeq, DEGseq, DESeq, DESeq2, EBSeq, edgeR, limma, Poisson-Seq, SAM-Seq.

Recommendation(s): Six of the tools that utilize negative binomial or log-normal distributions (edgeR, DESeq2, DESeq, baySeq, limma, and EBseq control their identification of false positives well.

Additional links: The authors released their benchmarking scripts on Github.


Title: How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use?

Authors: Nicholas J. Schurch*, Pietá Schofield*, Marek Gierliński*, Christian Cole*, Alexander Sherstnev*, et al.

Journal Info: RNA, March 2016

Description: This paper compared 11 differential expression tools on varying numbers of RNA-seq biological replicates (3-42) between two conditions. Each tool was compared against itself as a standard (using all replicates) and against the other tools.

Tools/methods compared: baySeq, cuffdiff, DEGSeq, DESeq, DESeq2, EBSeq, edgeR (exact and glm modes), limma, NOISeq, PoissonSeq, SAMSeq.

Recommendation(s): With fewer than 12 biological replicates, edgeR and DESeq2 were the top performers. As replicates increased, DESeq did a better job minimizing false positives than other tools.

Additionally, the authors recommend at least six biological replicates should be used, rising to at least 12 if users want to identify all significantly differentially expressed genes no matter the fold change magnitude.


Title: Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data

Authors: Franck Rapaport, et al.

Journal Info: Genome Biology, September 2013

Description: This paper compared six differential expression methods on three cell line data sets from ENCODE (GM12878, H1-hESC, and MCF-7) and two samples from the SEQC study, which had a large fraction of differentially expressed genes validated by qRT-PCR. Specificity, sensitivity, and false positive rate were the main benchmarking metrics used.

Tools/methods compared: Cuffdiff, edgeR, DESeq, PoissonSeq, baySeq, limma.

Recommendation(s): Though no method emerged as favorable in all conditions, those that used negative binomial modeling (DESeq, edgeR, baySeq) generally performed best.

The more replicates, the better. Replicate numbers (both biological and technical) have a greater impact on differential detection accuracy than sequencing depth.

Gene Set Enrichment Analysis

Title: Toward a gold standard for benchmarking gene set enrichment analysis

Authors: Ludwig Geistlinger, et al.

Journal Info: Briefings in Bioinformatics, February 2020

Description: This paper developed a Bioconductor package for reproducible GSEA benchmarking, and used the package to assess 10 widely used enrichment methods with regard to runtime and applicability to RNA-seq data, fraction of enriched gene sets depending on the null hypothesis tested, and recovery of biologically relevant gene sets. The framework can be extended to additional methods, datasets, and benchmark criteria, and should serve as a gold standard for future GSEA benchmarking studies.

Tools/methods compared: The paper quantitatively asseses the performance of 10 enrichment methods (ORA, GSEA, GSA, PADOG, SAFE, CAMERA, ROAST, GSVA, GLOBALTEST, SAMGS). The paper also compares 10 frequently used enrichment tools implementing these methods (DAVID, ENRICHR, CLUSTER-PROFILER, GOSTATS, WEBGESTALT, G:PROFILER, GENETRAIL, GORILLA, TOPPGENE, PANTHER).

Recommendation(s):

  • ORA for the exploratory analysis of simple gene lists, pre-ranked GSEA or pre-ranked CAMERA for the analysis of pre-ranked gene lists accompanied by gene scores such as fold changes,
  • For enrichment analysis on the full expression matrix (genes x samples), the paper recommends to provide normalized log2 intensities for microarray data and logTPMs (or logRPKMs/logFPKMs) for RNA-seq data; when given raw read counts, the paper recommends to apply a variance-stabilizing transformation such as voom to arrive at library-size normalized logCPMs,
  • ROAST (sample group comparisons) or GSVA (single sample) if the question of interest is to test for association of any gene in the set with the phenotype (self-contained null hypothesis),
  • PADOG (simple experimental designs) or SAFE (extended experimental designs) if the question of interest is to test for excess of differential expression in a gene set relative to genes outside the set (competitive null hypothesis).

Additional links: http://bioconductor.org/packages/GSEABenchmarkeR

Differential Splicing

Title: A survey of software for genome-wide discovery of differential splicing in RNA-Seq data

Authors: Joan E Hooper

Journal Info: Human Genomics, January 2014

Description: This paper compares the methodologies, advantages, and disadvantages of eight differential splicing analysis tools, detailing use-cases and features for each.

Tools/methods compared: Cuffdiff2, MISO, DEXSeq, DSGseq, MATS, DiffSplice, Splicing compass, AltAnalyze.

Recommendation(s): This is a true breakdown of each tools' advantages and disadvantages. The author makes no recommendation due to the performance reliance on experimental setup, data type (e.g. AltAnalyze works best on junction + exon microarrays), and user objectives. Table 1 provides a good comparison of the features and methodology of each method.


Title: A benchmarking of workflows for detecting differential splicing and differential expression at isoform level in human RNA-seq studies

Authors: Gabriela A Merino

Journal Info: Briefings in Bioinformatics, March 2019

Description: This paper compares nine most commonly used workflows to detect differential isoform expression and splicing.

Tools/methods compared: EBSeq, DESeq2, NOISeq, Limma, LimmaDS, DEXSeq, Cufflinks, CufflinksDS, SplicingCompass.

Recommendation(s): DESeq2, Limma and NOISeq for differential isoform expression(DIE) analysis and DEXSeq and LimmaDS for differential splicing (DS) testing.

de novo Assembly and Quantification

Title: Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data

Authors: Katharina E. Hayer et al.

Journal Info: Bioinformatics, Dec 2015

Description: This paper compared both guided and de novo transcript reconstruction algorithms using simulated and in vitro transcription (IVT) generated libraries. Precision/recall metrics were obtained by comparing the reconstructed transcripts to their true models.

Tools/methods compared: Cufflinks, CLASS, FlipFlop, IReckon, IsoLasso, MiTie, StringTie, StringTie-SR, AUGUSTUS, Trinity, SOAP, Trans-ABySS.

Recommendation(s): All tools measured produced less than ideal precision-recall (both <90%) when using imperfect simulated or IVT data and genes producing mulitple isoforms. Cufflinks and StringTie are among the best performers.


Title: De novo transcriptome assembly: A comprehensive cross-species comparison of short-read RNA-Seq assemblers

Authors: Martin Holzer & Manja Marz

Journal Info: GigaScience, May 2019

Description: This paper compares 10 de novo assembly tools across 9 RNA-seq datasets spanning multiple species and kingdoms for 20 biological-based and reference-free metrics.

Tools/methods compared: Trinity, Oases, Trans-ABySS, SOAPdenovo-Trans, IDBA-Tran, Bridger, BinPacker, Shannon, SPAdes-sc, SPAdes-rna.

Recommendation(s): The authors found that no tool's performance was dominant for all data sets, but Trinity, SPAdes, and Trans-ABySS were typically among the best. For assembly evaluation, the authors recommend a hybrid approach combining both biological-based (BUSCO, # of full length transcripts) and reference-free metric (e.g. TransRate, DETONATE).

Additional links: The authors provide a comprehensive electronic supplement website containing all metrics and assembly commands in addition to many supplementary figures.

Cell-Type Deconvolution

Title: Comprehensive evaluation of transcriptome-based cell-type quantification methods for immuno-oncology

Authors: Markus List*, Tatsiana Aneichyk*, et al.

Journal Info: Bioinformatics, July 2019

Description: This paper benchmarks and compares seven methods for computational deconvolution of cell-type abundance in bulk RNA-seq samples. Each method was tested on both simulated and true bulk RNA-seq samples validated by FACS.

Tools/methods compared: quanTIseq, TIMER, CIBERSORT, CIBERSORT abs. mode, MCPCounter, xCell, EPIC.

Recommendation(s): Varies. In general, the authors recommend EPIC and quanTIseq due to their overall robustness and absolute (rather than relative) scoring, though xCell is recommended for binary presence/absence of cell types and MCPcounter was their recommended relative scoring method.

Additional links: The authors created an R package called immunedeconv for easy installation and use of all these methods. For developers, they have made their benchmarking pipeline available so that others can reproduce/extend it to test their own tools/methods.


Title: Comprehensive benchmarking of computational deconvolution of transcriptomics data

Authors: Francisco Avila Cobos, et al.

Journal Info: bioRxiv, January 2020

Description: This paper compared the effects of transformation, scaling/normalization, marker selection, cell type composition, and deconvolution methods on computing cell type proportions in mixed bulk RNA-seq samples. Performance is assessed by means of Pearson correlation and root-mean-square-error (RMSE) between the cell type proportions computed by the different deconvolution methods and known compositions of 1000 pseudo-bulk mixtures from 4 different single cell RNA-seq datasets with varying numbers of cells.

Tools/methods compared: Transformation methods: linear (none), log, VST (DESeq2), sqrt. Scaling/normalization methods (bulk): column-wise, min-max, z-score, QN, UQ, row-wise, global min-max, global z-score, TPM, TMM, median ratios, LogNormalize. Scaling/normalization methods (single cell): RNBR, scran, scater, Linnorm. Deconvolution methods (bulk): OLS, NNLS, FARDEEP, RLR, lasso, ridge, DCQ, elastic net, DSA, EPIC, CIBERSORT, dtangle, ssFrobenius, ssKL, DeconRNASeq. Deconvolution methods (scRNA-seq reference): BisqueRNA, deconvSeq, DWLS, MuSiC, SCDC.

Recommendation(s): The authors strongly recommend keeping data in the linear scale for deconvolution, avoiding the use of column min-max, column z-score, and QN for normalization/scaling of bulk RNA-seq data, avoiding the use of row-normalization, column min-max, and TPM for normalization/scaling of single cell RNA-seq data, use all possible cell markers, ensure that all possible cell types are represented in the reference matrix, and use one of the top performing deconvolution methods - OLS, nnls, RLR, FARDEEP, CIBERSORT, DWLS, MuSiC, or SCDC.

Additional links: The authors provide their benchmarking code on Github.

RNA/cDNA Microarrays

Variant Callers

Germline SNP/Indel Callers

Title: Benchmarking variant callers in next-generation and third-generation sequencing analysis

Authors: Surui Pei, et al.

Journal Info: Briefings in Bioinformatics, July 2020

Description: This paper compared evaluated 11 modes among 6 variant callers on 12 NGS and TGS datasets on germline and somatic variant calling.

Tools/methods compared: Sentieon (TNscope, TNseq, DNAseq), DeepVariant (WGS), GATK (HC & MuTect2), NeuSomatic, VarScan2, Strelka2

Recommendation(s): All the four germline callers had comparable performance on NGS data. For TGS data, all the three callers had similar performance in SNP calling, while [DeepVariant]9https://github.com/google/deepvariant) outperformed the others in InDel calling. For somatic variant calling on NGS, Sentieon TNscope and GATK Mutect2 outperformed the other callers. Sentieon had the computational cost.

--

Title: Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers

Authors: Jiayun Chen, et al.

Journal Info: Scientific Reports, June 2019

Description: This paper compared three variant callers for WGS and WES samples from NA12878 across five next-gen sequencing platforms

Tools/methods compared: GATK, Strelka2, Samtools-Varscan2.

Recommendation(s): Though all methods tested generally scored well, Strelka2 had the highest F-scores for both SNP and indel calling in addition to being the most computationally performant.


Title: Comparison of three variant callers for human whole genome sequencing

Authors: Anna Supernat, et al.

Journal Info: Scientific Reports, December 2018

Description: The paper compared three variant callers for WGS samples from NA12878 at 10x, 15x, and 30x coverage.

Tools/methods compared: DeepVariant, GATK, SpeedSeq.

Recommendation(s): All methods had similar F-scores, precision, and recall for SNP calling, but DeepVariant scored higher across all metrics for indels at all coverages.


Title: A Comparison of Variant Calling Pipelines Using Genome in a Bottle as a Reference

Authors: Adam Cornish, et al.

Journal Info: BioMed Research International, October 2015

Description: This paper compared 30 variant calling pipelines composed of six different variant callers and five different aligners on NA12878 WES data from the "Genome in a Bottle" consortium.

Tools/methods compared:

  • Variant callers: FreeBayes, GATK-HaplotypeCaller, GATK-UnifiedGenotyper, SAMtools mpileup, SNPSVM
  • Aligners: bowtie2, BWA-mem, BWA-sampe, CUSHAW3, MOSAIK, Novoalign.

Recommendation(s): Novoalign with GATK-UnifiedGenotyper exhibited the highest sensitivity while producing few false positives. In general, BWA-mem was the most consistent aligner, and GATK-UnifiedGenotyper performed well across the top aligners (BWA, bowtie2, and Novoalign).

Somatic SNV/Indel callers

Title: Evaluation of Nine Somatic Variant Callers for Detection of Somatic Mutations in Exome and Targeted Deep Sequencing Data

Authors: Anne Bruun Krøigård, et al.

Journal Info: PLoS ONE, March 2016

Description: This paper performed comparisons between nine somatic variant callers on five paired tumor-normal samples from breast cancer patients subjected to WES and targeted deep sequencing.

Tools/methods compared: EBCall, Mutect, Seurat, Shimmer, Indelocator, SomaticSniper, Strelka, VarScan2, Virmid.

Recommendation(s): EBCall, Mutect, Virmid, and Strelka (now Strelka2) were most reliable for both WES and targeted deep sequencing. EBCall was superior for indel calling due to high sensitivity and robustness to changes in sequencing depths.


Title: Comparison of somatic mutation calling methods in amplicon and whole exome sequence data

Authors: Huilei Xu, et al.

Journal Info: BMC Genomics, March 2014

Description: Using the "Genome in a Bottle" gold standard variant set, this paper compared five somatic mutation calling methods on matched tumor-normal amplicon and WES data.

Tools/methods compared: GATK-UnifiedGenotyper followed by subtraction, MuTect, Strelka, SomaticSniper, VarScan2.

Recommendation(s): MuTect and Strelka (now Strelka2) had the highest sensitivity, particularly at low frequency alleles, in addition to the highest specificity.

CNV Callers

Title: Benchmark of tools for CNV detection from NGS panel data in a genetic diagnostics context

Authors: José Marcos Moreno-Cabrera, et al.

Journal Info: bioRxiv, November 2019.

Description: This paper compared five germline copy number variation callers against four genetic diagnostics datasets (495 samples, 231 CNVs validated by MLPA) using both default and optimized parameters. Sensitivity, specificity, positive predictive value, negative predictive value, F1 score, and various correlation coefficients were used as benchmarking metrics.

Tools/methods compared: DECoN, CoNVaDING, panelcn.MOPS, ExomeDepth, CODEX2.

Recommendation(s): Most tools performed well, but varied based on datasets. The authors felt DECoN and panelcn.MOPS with optimized parameters were sensitive enough to be used as screening methods in genetic dianostics.

Additional links: The authors have made their benchmarking code (CNVbenchmarkeR) available, which can be run to determine optimal parameters for each algorithm for a given user's data.


Title: An evaluation of copy number variation detection tools for cancer using whole exome sequencing data

Authors: Fatima Zare, et al.

Journal Info: BMC Bioinformatics, May 2017

Description: This paper compared six copy number variation callers on ten TCGA breast cancer tumor-normal pair WES datasets in addition to simulated datasets from VarSimLab. Sensitivity, specificity, and false-discovery rate were used as the benchmarking metrics.

Tools/methods compared: ADTEx, CONTRA, cn.MOPS, ExomeCNV, VarScan2, CoNVEX.

Recommendation(s): All tools suffered from high FDRs (~30-60%), but [ExomeCNV]https://github.com/cran/ExomeCNV) (a now defunct R package) had the highest overall sensitivity. VarScan2 had moderate sensitivity and specificity for both amplifications and deletions.

SV callers

Title: Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software

Authors: Daniel L. Cameron, et al.

Journal Info: Nature Communications, July 2019

Description: This paper compared 10 structural variant callers on four cell line WGS datasets (NA12878, HG002, CHM1, and CHM13) with orthogonal validation data. Precision and recall were the benchmarking metrics used.

Tools/methods compared: BreakDancer, cortex, CREST, DELLY, GRIDSS, Hydra, LUMPY, manta, Pindel, SOCRATES.

Recommendation(s): The authors found GRIDSS and manta consistently performed well, but also provide more general guidelines for both users and developers.

  • Use a caller that utilizes multiple sources of evidence and assembly.
  • Use a caller that can call all events you care about.
  • Ensemble calling is not a cure-all and generally don't outperform the best individual callers (at least on these datasets).
  • Do not use callers that rely only on paired-end data.
  • Calls with high read counts are typically artefacts.
  • Simulations aren't real - benchmarking solely on simulations is a bad idea.
  • Developers - be wary of incomplete trust sets and the potential for overfitting. Test tools on multiple datasets.
  • Developers - make your tool easy to use with basic sanity checks to protect against invalid inputs. Use standard file formats.
  • Developers - use all available evidence and produce meaningful quality scores.

Title: Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing

Authors: Shunichi Kosugi, et al.

Journal Info: Genome Biology, June 2019

Description: This study compared 69 structural variation callers on simulated and real (NA12878, HG002, and HG00514) datasets. F-scores, precision, and recall were the main benchmarking metrics.

Tools/methods compared: 1-2-3-SV, AS-GENESENG, BASIL-ANISE, BatVI, BICseq2, BreakDancer, BreakSeek, BreakSeq2, Breakway, CLEVER, CNVnator, Control-FREEC, CREST, DELLY, DINUMT, ERDS, FermiKit, forestSV, GASVPro, GenomeSTRiP, GRIDSS, HGT-ID, Hydra-sv, iCopyDAV, inGAP-sv, ITIS, laSV, Lumpy, Manta, MATCHCLIP, Meerkat, MELT, MELT-numt, MetaSV, MindTheGap, Mobster, Mobster-numt, Mobster-vei, OncoSNP-SEQ, Pamir, PBHoney, PBHoney-NGM, pbsv, PennCNV-Seq, Pindel, PopIns, PRISM, RAPTR, readDepth, RetroSeq, Sniffles, Socrates, SoftSearch, SoftSV, SoloDel, Sprites, SvABA, SVDetect, Svelter, SVfinder, SVseq2, Tangram, Tangram-numt, Tangram-vei, Tea, TEMP, TIDDIT, Ulysses, VariationHunter, VirusFinder, VirusSeq, Wham.

Recommendation(s): Varies greatly depending on type and size of the structural variant in addition to read length. GRIDSS, Lumpy, SVseq2, SoftSV, and Manta performed well calling deletions of diverse sizes. TIDDIT, forestSV, ERDS, and CNVnator called large deletions well, while pbsv, Sniffles, and PBHoney were the best performers for small deletions. For duplications, good choices included Wham, SoftSV, MATCHCLIP, and GRIDSS, while CNVnator, ERDS, and iCopyDAV excelled calling large duplications. For insertions, MELT, Mobster, inGAP-sv, and methods using long read data were most effective.


Title: Evaluating nanopore sequencing data processing pipelines for structural variation identification

Authors: Anbo Zhou, et al.

Journal Info: Genome Biology, November 2019

Description: This paper evaluated four alignment tools and three SV detection tools on four nanopore datasets (both simulated and real).

Tools/methods compared: aligners - minimap2, NGMLR, GraphMap, LAST. SV Callers - Sniffles, NanoSV, Picky.

Recommendation(s): The authors recommend using the minimap2 aligner in combination with the SV caller Sniffles because of their speed and relatively balanced performance.

Additional links (optional): The authors provide all code used in the study as well as a singularity package containing pre-installed programs and all seven pipeline.

Single Cell

scRNA Sequencing Protocols

Title: Benchmarking single-cell RNA-sequencing protocols for cell atlas projects

Authors: Elisabetta Mereu*, Atefeh Lafzi*, et al.

Journal Info: Nature Biotechnology, April 2020

Description: This paper evaluated 13 single cell/nuclei RNA-seq protocols to evaluate their aptitude for use in cell atlas-like projects. Using a single cell, multi-species mixture, the authors measured each protocol's ability to capture cell markers, gene detection power, clusterability (with and without integration with other protocols), mappability, and mixability.

Tools/methods compared: Quartz-seq2, Chromium, Smart-seq2, CEL-seq2, C1HT-medium, C1HT-small, ddSEQ, Chromium (single nuclei), Drop-seq, inDrop, ICELL8, MARS-seq, gmcSCRB-seq.

Recommendation(s): See figure 6 for a summary of benchmarking results for each method. Quartz-seq2 was the overall best performing, yielding superior results for gene detection and marker expression over other methods, though Chromium, Smart-seq2, and CEL-seq2 were also strong performers.

Additional links: The authors provide benchmarking code and analysis code in two different Github repositories - here and here.


Title: Systematic comparison of single-cell and single-nucleus RNA-sequencing methods

Authors: Jiarui Ding, et al.

Journal Info: Nature Biotechnology, April 2020

Description: This study evaluated seven methods for single-cell and/or single-nucleus RNA-sequencing on three types of samples: cell lines, PBMCs, and brain tissue. Evaluation metrics included the structure and alignment of reads, number of multiplets and detection sensitivity, and ability to recover known biological information.

Tools/methods compared: Smart-seq2, CEL-Seq2, 3' 10X Chromium, Drop-Seq, Seq-Well, inDrops, sci-RNA-seq.

Recommendation(s): Overall, the authors found 3' 10X Chromium to have the strongest consistent performance among the high-throughput methods, yielding the highest sensitivity, though it did not perform any better for cell type classification. When greater sensitivity is required, the authors recommend Smart-seq2 or CEL-Seq2, which both performed similarly. Supplementary table 7 includes an overview of each method's relative merits.

Additional links: The authors made their unified analysis pipeline (scumi) available as a python package, the repo of which also includes their R scripts used for cell filtering and cell type assignment.

scRNA Analysis Pipelines

Title: A systematic evaluation of single cell RNA-seq analysis pipelines

Authors: Beate Veith, et al.

Journal Info: Nature Communications, October 2019

Description: This study evaluated ~3000 pipeline combinations based on three mapping, three annotation, four imputation, seven normalization, and four differential expression testing approaches with five scRNA-seq library protocols on simulated data.

Tools/methods compared: scRNA-seq library prep protocols - SCRB-seq, Smart-seq2, CEL-seq2, Drop-seq, 10X Genomics. Mapping - bwa, STAR, kallisto. Annotation - gencode, refseq, vega. Imputation - filtering, DrImpute, scone, SAVER. Normalization - scran, SCnorm, Linnorm, Census, MR, TMM. Differential testing - edgeR-zingeR, limma, MAST, T-test.

Recommendation(s): Figure 5F contains a flowchart with the authors' recommendations. For alignment, STAR with Gencode annotations generally had the highest mapping and assignment rates. All mappers performed best with Gencode annotations. For normalization, scran was found to best handle potential assymetric differential expression and large numbers of differentially expressed genes. They also note that normalization is overall the most influential step, particularly if asymmetric DE is present (Figure 5). For Smart-seq2 data without spike-ins, the authors suggest Census may be the best choice. The authors found little benefit to imputation in most scenarios, particularly if one of the better normalization methods (e.g. scran) was used. The authors found library prep and normalization strategies to have a stronger effect on pipeline performance than the choice of differential expression tool, but generally found limma-trend to have the most robust performance.

Additional links: The authors made their simulation tool (powsimR) available on Github along with their pipeline scripts to reproduce their analyses.


Title: Comparison of high-throughput single-cell RNA sequencing data processing pipelines

Authors: Mingxuan Gao, et al.

Journal Info: Briefings in Bioinformatics, July 2020

Description: This study evaluated 7 scRNA-seq pipelines on 8 data sets.

Tools/methods compared: Drop-seq-tools version-2.3.0, Cell Ranger version-3.0.2, scPipe version-1.4.1, zUMIs version-2.4.5b, UMI-tools version-1.0.0, umis version-1.0.3, dropEst version-0.8.6

Recommendation(s): Cell Ranger shows the highest algorithm complexity and parallelization, whereas scPipe, umis and zUMIs have lower complexity that is suitable for large-scale scRNA-seq integration analysis. UMI-tools show the highest transcript quantification accuracy on ERCC datasets from three scRNA-seq platforms. Integration of expression matrices from different pipelines will introduce confounding factors akin to batch effect. zUMIs and dropEst have higher sensitivity to detectmore genes for single cells, which may also bring unwanted factors. For most downstream analysis, Drop-seq-tools, Cell Ranger and UMI-tools show high consistency, whereas umis and zUMIs show inconsistent results compared with the other pipelines.

scRNA Imputation Methods

Title: A Systematic Evaluation of Single-cell RNA-sequencing Imputation Methods

Authors: Wenpin Hou, et al.

Journal Info: bioRxiv, January 2020

Description: This paper evaluated 18 scRNA-seq imputation methods using seven datasets containing cell line and tissue data from several experimental protocols. The authors assessed the similarity of imputed cell profiles to bulk samples and investigated whether imputation improves signal recovery or introduces noise in three downstream applications - differential expression, unsupervised clustering, and trajectory inference.

Tools/methods compared: scVI, DCA, MAGIC, scImpute, kNN-smoothing, mcImpute, SAUCIE, DrImpute, PBLR, SAVER, VIPER, SAVERX, DeepImpute, scRecover, ALRA, bayNorm, AutoImpute, scScope.

Recommendation(s): Figure 6 provides a performance summary of the tested methods. In general, the authors recommend caution using any of these methods, as they can introduce significant variability and noise into downstream analyses. Of the methods tested, MAGIC, kNN-smoothing, and SAVER outperformed the other methods most consistently, though this varied widely across evaluation criteria, protocols, datasets, and downstream analysis. Many methods show no clear improvement over no imputation, and in some cases, perform significantly worse in downstream analyses.

Additional links (optional): The authors placed all of their benchmaking code on Github.

scRNA Differential Gene Expression

Title: Bias, robustness and scalability in single-cell differential expression analysis

Authors: Charlotte Soneson* & Mark D Robinson*

Journal Info: Nature Methods, February 2018

Description: This paper evaluated 36 approaches for determining differential gene expression from both synthetic and 36 real scRNA-seq datasets. The authors assess type I error control, FDR control and power, computational efficiency, and consistency.

Tools/methods compared: edgeRQLFDetRate, MASTcpmDetRate, limmatrend, MASTtpmDetRate, edgeRQLF, ttest, voomlimma, Wilcoxon, MASTcpm, MASTtpm, SAMseq, D3E, edgeRLRT, metagenomeSeq, edgeRLRTcensus, edgeRLRTdeconv, monoclecensus, ROTStpm, ROTSvoom, DESeq2betapFALSE, edgeRLRTrobust, monoclecount, DESeq2, DESeq2nofilt, ROTScpm, SeuratTobit, NODES, DESeq2census, scDD, BPSC, SCDE, DEsingle, monocle, SeuratBimodnofilt, SeuratBimodlsExpr2, SeuratBimod.

Recommendation(s): In general, the authors found that gene prefiltering was essential for good, robust performance from many methods. They note high variability between methods and summarize general performance across all metrics in Figure 5. They do not make recommendations as to a specific method/tool. Of note is that Seurat switched to using the wilcoxon test by default after this study was released, as it performed much better than their previously available methods.

Additional links: The authors make their benchmarking pipeline, conquer, available on Github. Their processed data and associated reports have also been made available for additional comparisons.

Trajectory Inference

Title: A comparison of single-cell trajectory inference methods

Authors: Wouter Saelens*, Robrecht Cannoodt*, et al.

Journal Info: Nat Biotech, April 2019

Description: A comprehensive evaluation of 45 trajectory inference methods, this paper provides an unmatched comparison of the rapidly evolving field of single-cell trajectory inference. Each method was scored on accuracy, scalability, stability, and usability. Should be considered a gold-standard for other benchmarking studies.

Tools/methods compared: PAGA, RaceID/StemID, SLICER, Slingshot, PAGA Tree, MST, pCreode, SCUBA, Monocle DDRTree, Monocle ICA, cellTree maptpx, SLICE, cellTree VEM, EIPiGraph, Sincell, URD, CellTrails, Mpath, CellRouter, STEMNET, FateID, MFA, GPfates, DPT, Wishbone, SCORPIUS, Component 1, Embeddr, MATCHER, TSCAN, Wanderlust, PhenoPath, topslam, Waterfall, EIPiGraph linear, ouijaflow, FORKS, Angle, EIPiGraph cycle, reCAT.

Recommendation(s): Varies depending on dataset and expected trajectory type, though PAGA, PAGA Tree, SCORPIUS, and Slingshot all scored highly across all metrics.

Authors wrote an interactive Shiny app to help users choose the best methods for their data.

Additional links: The dynverse site contains numerous packages for users to run and compare results from different trajectory methods on their own data without installing each individually by using Docker. Additionally, they provide several tools for developers to wrap and benchmark their own method against those included in the study.

Gene Regulatory Network Inference

Title: Benchmarking algorithms for gene regulatory network inference from single-cell transcriptomic data

Authors: Aditya Pratapa, et al.

Journal Info: Nature Methods, January 2020

Description: The authors compared 12 gene regulatory network (GRN) inference techniques to assess the accuracy, robustness, and efficiency of each method on simulated data from synthetic networks, simulated data from curated models, and real scRNA-seq datasets.

Tools/methods compared: GENIE3, PPCOR, LEAP, SCODE, PIDC, SINCERITIES, SCNS, GRNVBEM, SCRIBE, GRNBoost2, GRISLI, SINGE.

Recommendation(s): In general, the authors recommend PIDC, GENIE3, or GRNBoost2, as they were the leading and consistent performers for both curated models and experimental datasets in terms of accuracy as well as having multithreaded implementations. Both the GENIE3 and GRNBoost2 methods can be used from the SCENIC workflows in either R or the (much faster) python implementation.

Additional links: The authors provide their benchmarking framework, BEELINE on Github, which also provides an easy-to-use and uniform interface to each method in the form of a Docker image.


Title: Evaluating methods of inferring gene regulatory networks highlights their lack of performance for single cell gene expression data

Authors: Shounan Chen* & Jessica C. Mar*

Journal Info: BMC Bioinformatics, June 2018

Description: This study compared 8 gene regulatory network inference methods (5 bulk RNA-seq, 3 specific to scRNA-seq) for scRNA-seq data for precision and recall.

Tools/methods compared: Partial correlation (Pcorr), Bayesian networks, GENIE3, ARACNE, CLR, SCENIC, SCODE, PIDC.

Recommendation(s): Generally, the authors found relatively poor performance across all methods both for simulated and real data. The results from each method had few similarities with other methods and high false positive rates, and the authors recommend caution when interpreting the networks reconstructed with these methods. The authors showed that many of these methods were dramatically affected by dropout events.

Integration/Batch Correction

Title: A benchmark of batch-effect correction methods for single-cell RNA sequencing data

Authors: Hoa Thi Nhu Tran et al.

Journal Info: Genome Biology, January 2020

Description: The authors compared 14 methods in terms of computational runtime, the ability to handle large datasets, and batch-effect correction efficacy while preserving cell type purity.

Tools/methods compared: Seurat2, Seurat3, Harmony, fastMNN, MNN Correct, ComBat, Limma, scGen, Scanorama, MMD-ResNet, ZINB-WaVe, scMerge, LIGER, BBKNN

Recommendation(s): Based on the benchmarking results authors suggest Harmony, LIGER, and Seurat3 as best methods for batch integration.

Dimensionality Reduction

Title: Accuracy, Robustness and Scalability of Dimensionality Reduction Methods for Single Cell RNAseq Analysis

Authors: Shiquan Sun, et al.

Journal Info: BioRxiv, October 2019

Description: A mammoth comparison of 18 different dimension reduction methods on 30 publicly available scRNAseq data sets in addition to 2 simulated datasets for a variety of purposes ranging from cell clustering to trajectory inference to neighborhood preservation.

Tools/methods compared: factor analysis (FA), principal component analysis (PCA), independent component analysis (ICA), Diffusion Map, nonnegative matrix factorization (NMF), Poisson NMF, zero-inflated factor analysis (ZIFA), zero-inflated negative binomial based wanted variation extraction (ZINB-WaVE), probabilistic count matrix factorization (pCMF), deep count autoencoder network (DCA), scScope, generalized linear model principal component analysis (GLMPCA), multidimensional scaling (MDS), locally linear embedding (LLE), local tangent space alignment (LTSA), Isomap, uniform manifold approximation and projection (UMAP), t-distributed stochastic neighbor embedding (tSNE).

Recommendation(s): Varies depending on use case. Factor Analysis and principal component analysis performed well for most use cases. See figure 5 for pratical guidelines.

Additional links: The authors have made their benchmarking code available on Github.


Title: Benchmarking principal component analysis for large-scale single-cell RNA-sequencing

Authors: Koki Tsuyuzaki, et al.

Journal Info: Genome Biology, January 2020

Description: This study compared 21 implementations of 10 algorithms across Python, R, and Julia for principal component analysis for scRNA-seq data, measuring scalability, computational efficiency, outlier robustness, t-SNE/UMAP replication, ease of use, and more using both synthetic and real datasets.

Tools/methods compared: PCA (sklearn, full), fit (MultiVariateStats.jl), Downsampling, IncrementalPCA (sklearn), irlba (irlba), svds (RSpectra), propack.svd (svd), PCA (sklearn, arpack), irlb (Cell Ranger), svds (Arpack.jl), orthiter (OnlinePCA.jl), gd (OnlinePCA.jl), sgd (OnlinePCA.jl), rsvd (rsvd), oocPCA_CSV (oocRPCA), PCA (sklearn, randomized), randomized_svd (sklearn), PCA (dask-ml), halko (OnlinePCA.jl), algorithm971 (OnlinePCA.jl).

Recommendation(s): Author recommendations vary based on the language being used and matrix size. See figure 8 for recommendations along with recommended parameter settings.

Additional links: The authors published their benchmarking scripts on Github.

Cell Annotation/Inference

Title: A comparison of automatic cell identification methods for single-cell RNA sequencing data

Authors: Tamim Abdelaal*, Lieke Michielsen*, et al.

Journal Info: Genome Biology, September 2019

Description: The authors benchmarked 22 classification methods that automatically assign cell identities including single-cell-specific and general-purpose classifiers across 27 publicly available single-cell RNA sequencing datasets of different sizes, technologies, species, and levels of complexity. Two types of experimental setups were used evaluate the performance of each method for within dataset predictions (intra-dataset) and across datasets (inter-dataset) based on accuracy, percentage of unclassified cells, and computation time.

Tools/methods compared: Garnett, Moana, DigitalCellSorter, SCINA, scVI, Cell-BLAST, ACTINN, LAmbDA, scmapcluster, scmapcell, scPred, CHETAH, CaSTLe, SingleR, scID, singleCellNet, LDA, NMC, RF, SVM, SVM<sub>rejection</sub>, kNN

Recommendation(s): All classifiers performed well. The authors recommended SVMrejection classifier (with a linear kernel). Other classifiers include SVM , singleCellNet, scmapcell, and scPred were also of high performances.

In their experiments, incorporating prior knowledge in the form of marker genes does not improve the performance.

Additional links: A Snakemake workflow, scRNAseq_Benchmark, was provided to automate the benchmarking analyses.

Variant Calling

Title: Systematic comparative analysis of single-nucleotide variant detection methods from single-cell RNA sequencing data

Authors: Fenglin Liu*, Yuanyuan Zhang*, et al.

Journal Info: Genome Biology, November 2019

Description: This paper compared seven variant callers using both simulation and real scRNA-seq datasets and identified several elements influencing their performance, including read depth, variant allele frequency, and specific genomic contexts. Sensitivity and specificity were the benchmarking metrics used.

Tools/methods compared: SAMtools, GATK, CTAT, FreeBayes, MuTect2, Strelka2, VarScan2.

Recommendation(s): Varies, see figure 7 for a flowchart breakdown. Generally, SAMtools (most sensitive, lower specificity in intronic or high-identity regions), Strelka2 (good performance when read depth >5), FreeBayes (good specificity/sensitivity in cases with high variant allele frequencies), and CTAT (no alignment step necessary) were top performers.

Additional links: The authors made their benchmarking code available on Github.

ATAC-seq

Title: Assessment of computational methods for the analysis of single-cell ATAC-seq data

Authors: Caleb Lareau*, Tommaso Andreani*, Micheal E. Vinyard*, et al.

Journal Info: Genome Biology, November 2019

Description: This study compares 10 methods for scATAC-seq processing and featurizing using 13 synthetic and real datasets from diverse tissues and organisms.

Tools/methods compared: BROCKMAN, chromVAR, cisTopic, Cicero, Gene Scoring, Cusanovich2018, scABC, Scasat, SCRAT, SnapATAC.

Recommendation(s): SnapATAC, Cusanovich2018, and cisTopic were the top performers for separating cell populations of different coverages and noise levels. SnapATAC was the only method capable of analyzing a large dataset (>80k cells).

Additional links: The authors have made their benchmarking code available on Github.

Statistics

False Discovery Rates

Title: A practical guide to methods controlling false discoveries in computational biology

Authors: Keegan Korthauer*, Patrick K. Kimes*, et al.

Journal Info: Genome Biology, June 2019

Description: An benchmark comparison of the accuracy, applicability, and ease of use of two classic and six modern methods that control for the false discovery rate (FDR). Used simulation studies as well as six case studies in computational biology (specifically differential expression testing in bulk RNA-seq, differential expression testing in single-cell RNA-seq, differential abundance testing and correlation analysis in 16S microbiome data, differential binding testing in ChIP-seq, genome-wide association testing, and gene set analysis).

Tools/methods compared: Benjamini-Hochberg, Storey’s q-value, conditional local FDR (LFDR), FDR regression (FDRreg), independent hypothesis weighting (IHW), adaptive shrinkage (ASH), Boca and Leek’s FDR regression (BL), and adaptive p-value thresholding (AdaPT).

Recommendation(s): Modern FDR methods that use an informative covariate (as opposed to only p-values) leads to more power while controlling the FDR over classic methods. The improvement of the modern FDR methods over the classic methods increases with the informativeness of the covariate, total number of hypothesis tests, and proportion of truly non-null hypotheses.

Additional links: Full analyses of the in silico experiments, simulations, and case studies are provided in Additional files 2–41 at https://pkimes.github.io/benchmark-fdr-html/. The source code to reproduce all results in the manuscript and additional files, as well as all figures, is available on GitHub. An ExperimentHub package containing the full set of results objects is available through the Bioconductor project, and a Shiny application for interactive exploration of these results is also available on GitHub. The source code, ExperimentHub package, and Shiny application are all made available under the MIT license.

Microbiome

Diversity analysis

Title: Evaluating Bioinformatic Pipeline Performance for Forensic Microbiome Analysis

Authors: Sierra F. Kaszubinski*, Jennifer L. Pechal*, et al.

Journal Info: Journal of Forensic Sciences, 2019

Description: Sequence reads from postmortem microbiome samples were analyzed with mothur v1.39.5, QIIME2 v2018.11, and MG-RAST v4.0.3. For postmortem data, MG-RAST had a much smaller effect size than mothur and QIIME2 due to the twofold reduction in samples. QIIME2 and Mothur returned similar results, with Mothur showing inflated richness due to unclassified taxa. Adjusting minimum library size had significant effects on microbial community structure, sample size less so except for low abundant taxa.

Tools/methods compared: mothur, QIIME2, MG-RAST

Recommendation(s): QIIME2 was deemed the most appropriate choice for forensic analysis in this study.

Additional links: Sequence data are archived through the European Bioinformatics Institute European Nucleotide Archive (www.ebi.ac.uk/ena) under accession number: PRJEB22642. Pipeline parameters and microbial community analyses are available on GitHub (https://github.com/sierrakasz/postmortem-analysis).

Contributors

About

A curated list of bioinformatics bench-marking papers and resources.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published