Skip to content

A pipeline to generate and benchmark synthetic datasets as described in Sergi et al., 2024.

License

Notifications You must be signed in to change notification settings

DIncalciLab/LFV-benchmark

Repository files navigation

Introduction

dincalcilab/LFV-benchmark is a bioinformatics pipeline to generate syntethic data sets to benchmark low-fraction somatic variant callers (in case, variants with higher fractions can be benchmarked).

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple computing infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes should be installed from the nf-core/modules repository.

Pipeline summary

  1. Generate artificial BAM files with NEAT
  2. Generate random variant sites (SNV and indels) with BAMsurgeon
  3. Spike-in the random variants in the artificial BAM files and generate a ground truth VCF (BAMsurgeon)
  4. Benchmark the following variant callers against the ground truth:

Requirements

  • Nextflow (>=21.10.3)
  • Picard (< 3.00) (A copy of the Picard.jar can be found in the /assets folder)
  • Java (< 17) (to avoid errors with Picard <3.00)
  • matplotlib, openpyxl and cyvcf2 (for performance analysis and plots)
  • Fasta file, along with its index and dictionary (e.g. hg19/hg38)

Quick Start

  1. Install Nextflow (>=21.10.3)

  2. Install Java (< 17)

  3. Download the pipeline:

    nextflow pull https://github.com/DIncalciLab/LFV-benchmark

    Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (YOURPROFILE in the example command above). You can chain multiple config profiles in a comma-separated string.

    • The pipeline comes with config profiles called docker, singularity, podman, shifter, charliecloud and conda which instruct the pipeline to use the named tool for software management. For example, -profile test,docker. A simple configuration (test_local) to run the pipeline in local on a low-memory machine is also provided. You can change it depending on your needs.
    • If you are using singularity, please use the nf-core download command to download images first, before running the pipeline. Setting the NXF_SINGULARITY_CACHEDIR or singularity.cacheDir Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.
    • If you are using conda, it is highly recommended to use the NXF_CONDA_CACHEDIR or conda.cacheDir settings to store the environments in a central location for future pipeline runs.
  4. Test the pipeline. Different use cases are given below.

  5. NB: Due to a bug in Pandas/Numpy old versions it is advised to run the script to calculate variant calling performance separately from the pipeline, as described in the "Use cases" section.

Use cases

Generate 2 tumor/normal pair with a coverage of 30x with random SNV

nextflow run DIncalciLab/LFV-benchmark --input_all <input_csv> --outdir <OUTDIR> --bed <path_to_bed> --fasta <FASTA> --samples 2 --coverage 30 --type snv --high_sensitivity -profile test_local

where:

  --input_all           path to csv file containing the name of the samples to be generated (`/assets/samplesheet.csv` is an example of this file)
  --outdir              path of the output folder
  --bed                 path of the bed file, containing the regions to be generated
  --fasta               path of the fasta file (e.g. hg19/hg38). NB: the fasta folder should also contains the fasta index and dictionary files
  --samples             number of samples to be generated
  --coverage            coverage of the artificial samples to be generated
  --type                type of mutations spiked-in (SNVs/INDELs/both)
  --high-sensitivity    use tuned parameters for variant calling
  -profile test_local        run the pipeline using the profile for a low-computational machine (for testing purposes only)

Please note that if you are running the pipeline in local (using -profile local_test) it is advised to do not run the most computationally expensive variant callers in high-sensitivity mode (due to the high amount of time required).
To do so, either run the pipeline with varian callers default parameters (remove --high-sensitivity command) or add the following commands to jump Freebayes/Varscan variant callers:

--skip_freebayes
--skip_varscan

Then calculate the performance of the callers by launching the script benchmark_standalone.py located in the /bin folder, using the following command:

./benchmark_standalone.py -t snv -s <BAMSURGEON_VCF> -v <VARIANT_CALLING_FOLDER> -o <OUTDIR>

where:

  -t              type of variants inserted from BAMSURGEON
  -s              directory with VCF files generated from BAMSURGEON with the random SNVs inserted in the generated normal samples
  -v              output folder generated from the pipeline with the variant calling outputs (normally `outdir/variant_calling`
  -o              output folder

Generate tumor (with random SNVs) from input normal BAM and perform the benchmark

Very similar to the previous case. Just input the folder with the normal BAM and use the command --skip_normal_generation to do not generate the normal samples with NEAT:

nextflow run DIncalciLab/LFV-benchmark --input_normal <NORMALBAM> --outdir <OUTDIR> --bed <path_to_bed> --fasta <FASTA> --type snv --high_sensitivity -profile test_local --skip_normal_generation -profile test_local

Then calculate the performance of the callers by launching the script benchmark_standalone.py.

Perform the benchmark on 2 existent tumor/normal pair with a coverage of 100X and with random SNVs inserted:

nextflow run DIncalciLab/LFV-benchmark --outdir <OUTDIR> --fasta <FASTA> --bed <BED> --input_normal <NORMALBAM> --input_tumor <TUMORBAM> --skip_normal_generation --skip_tumor_generation --high_sensitivity -profile test_local

and give to --input_normal and --input_tumor the path of the folder with the test BAM files, normal and tumor respectively (can be found in the test_files folder of this repo).

Then calculate the performance as described previously. The VCF files generated from BAMSURGEON with the inserted SNVs can be found in test_files/spiked_vcf/SNV/100X

Perform the benchmark on a single tumor sample with a coverage of 30.000X with spiked SNVs/INDELs from BAMSURGEON

nextflow run DIncalciLab/LFV-benchmark --outdir <OUTDIR> --fasta <FASTA> --input_tumor <TUMORBAM> --bed <BED> --skip_normal_generation --skip_tumor_generation --high_sensitivity --tumor_only -profile test_local

Input tumor can be found in test_files/tumor_bam/high_coverage Then calculate the performance of the callers by launching the script benchmark_standalone_tumor_only.py located in the /bin folder, using the following command:

./benchmark_standalone_tumor_only.py -t both -s <BAMSURGEON_VCF> -i <BAMSURGEON_VCF_INDEL> -v <VARIANT_CALLING_FOLDER> -o <OUTDIR>

where --i is the directory with VCF files generated from BAMSURGEON with the random INDELs inserted in the generated samples. The VCF files generated from BAMSURGEON with the inserted SNVs/INDELs can be found in test_files/spiked_vcf/SNV/high_coverage and test_files/spiked_vcf/INDEL/high_coverage

Usage

The pipeline steps can be run either together or separately. To run the entire workflow (e.g. generate artificial datasets, spike-in somatic variants and benchmark the variant callers), run:

nextflow run DIncalciLab/LFV-benchmark \
        --outdir OUTDIR \
        [--readlen INT] \
        --fasta FASTA \
        [--bed BED] \
        [--coverage INT] \
        [--error_model ERROR_MODEL] \
        [--mutation_model MUT_MODEL] \
        [--gc_model GC_MODEL] \
        [--fraglen_model FRAGLEN_MODEL] \
        --picardjar PICARDJAR \
        [--type snv] \
        --samples 5

mandatory arguments:
  --outdir              output directory
  --fasta               path to the fasta file
  --samples             number of artificial samples to be generated

optional arguments:
  --picardjar           path to the picard.jar file (default: $HOME/.nextflow/assets/DIncalciLab/LFV-benchmark/assets/picard.jar)
  --readlen             int (default: 151)        length of artificial reads generated by NEAT
  --bed                 FASTA                     print version and exit
  --coverage            int (default: 10000)      Coverage depth of the artificial samples generated by NEAT
  --error_model         ERROR_MODEL               path of sequencing error model used by NEAT
  --mutation_model      MUT_MODEL                 path of mutational rate model used by NEAT
  --gc_model            GC_MODEL                  path of GC model used by NEAT
  --fraglen_model       FRAGLEN_MODEL             path of the fragment length model used by NEAT
  --type                string (default: both)    type of low-fraction variants to spike-in in the artificial samples. Can be: snv, indel, both

Default models utilized by NEAT are available in the /assets/neat_models folder.

Skip the generation of normal samples

To spike-in variants in existing normal/tumor samples, just skip the NEAT step (i.e. the generation of artificial normal samples) adding the following commands to the pipeline:

--input_normal              path to the folder containing the BAM samples to spike-in and benchmark
--skip_normal_generation

Skip the generation of normal and tumor samples

To run only the variant calling benchmark on existing normal/tumor BAM files, the spike-in step (operated by BAMsurgeon) can be skipped, using the following command:

[--input_normal]          path to the folder containing the normal BAM samples to be used in the tumor-normal paired mode
--input_tumor             path to the folder containing the BAM samples to benchmark
--skip_normal_generation
--skip_tumor_generation

Skip the generation of tumor samples

To run the benchmark on the artificial files generated by NEAT (for testing purpouse mainly), skip only the spike-in process using the following command:

--skip_tumor_generation

Skip the benchmark step

The benchmark step can be skipped using the following command:

--skip_benchmark

Moreover, each variant caller can be skipped independently, using the commands --skip_vardict, --skip_mutect and so on...

High-sensitivity mode

To run the variant callers using the optimized set of parameters (see the related paper for details), just add the command --high_sensitivity to the pipeline.

Credits

DIncalciLab/LFV-benchmark was originally written by Aldo Sergi and Luca Beltrame.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

About

A pipeline to generate and benchmark synthetic datasets as described in Sergi et al., 2024.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published