Snakemake workflow to analyse hematological malignancies in whole genome data when only tumor sample is available
This snakemake workflow uses modules from hydragenetics to process .fastq
files and obtain different kind
of variants (SNV, indels, CNV, SV). Alongside diagnosis-filtered .vcf
files, the workflow produces a
multiqc report .html
file and some CNV plots. One of the modules contains the commercial
parabricks toolkit which can be replaced by
opensource GATK tools if required. The following modules are currently part of this pipeline:
- annotation
- cnv_sv
- compression
- misc
- parabricks
- prealignment
- qc
In order to use this module, the following dependencies are required:
Input data should be added to
samples.tsv
and
units.tsv
.
The following information need to be added to these files:
Column Id | Description |
---|---|
samples.tsv |
|
sample | unique sample/patient id, one per row |
tumor_content | ratio of tumor cells to total cells |
units.tsv |
|
sample | same sample/patient id as in samples.tsv |
type | data type identifier (one letter), can be one of Tumor, Normal, RNA |
platform | type of sequencing platform, e.g. NovaSeq |
machine | specific machine id, e.g. NovaSeq instruments have @Axxxxx |
flowcell | identifer of flowcell used |
lane | flowcell lane number |
barcode | sequence library barcode/index, connect forward and reverse indices by + , e.g. ATGC+ATGC |
fastq1/2 | absolute path to forward and reverse reads |
adapter | adapter sequences to be trimmed, separated by comma |
Reference files should be specified in
config.yaml
- A
.fasta
reference file of the human genome is required as well as an.fai
file and an bwa index of this file. - A
.vcf
file containing known indel sites. For GRCh38, this file is available as part of the Broad GATK resource bundle at google cloud. - An
.interval_list
file containing all whole genome calling regions. The GRCh38 version is also available at google cloud. - The
trimmer_software
should be specified by indicating a rule which should be used for trimming. This pipeline usesfastp_pe
. .bed
files defining regions of interest for different diagnoses. This pipeline is assumingALL
andAML
and different gene lists for SNVs and SVs.- For pindel, a
.bed
file containing the region that the analysis should be limited to. - simple_sv_annotation comes with panel and a fusion
pair list which should also be included in the
config.yaml
. - Annotation with SnpEff a database is needed which can be downloaded through the cli.
- For VEP, a cache resource should be downloaded prior to running the workflow.
To run the workflow,
resources.yaml
is needed which defines different resources as default and for different rules. For parabricks, the gres
stanza is needed and should specify the number of GPUs available.
snakemake --profile my-profile
File | Description |
---|---|
cnv_sv/cnvkit_diagram/{sample}_T.png |
chromosome diagram from cnvkit |
cnv_sv/cnvkit_scatter/{sample}_T_{chromosome}.png |
scatter plot per chromosome from cnvkit |
cnv_sv/cnvkit_vcf/{sample}_T.vcf |
.vcf output from cnvkit |
cnv_sv/pindel/{sample}.vcf |
.vcf output from pindel |
compression/crumble/{sample}_{type}.crumble.cram |
crumbled .cram file |
compression/crumble/{sample}_{type}.crumble.cram.crai |
index for crumbled .cram file |
compression/spring/{sample}_{flowcell}_{lane}_{barcode}_{type}.spring |
compressed .fastq file pair |
tsv_files/{sample}_mutectcaller_t.aml.tsv |
.tsv file for excel containing SNVs from mutect2 for AML |
tsv_files/{sample}_mutectcaller_t.all.tsv |
.tsv file for excel containing SNVs from mutect2 for ALL |
tsv_files/{sample}_manta_t.aml.tsv |
.tsv file for excel containing SVs from manta for AML |
tsv_files/{sample}_manta_t.all.tsv |
.tsv file for excel containing SVs from manta for ALL |
qc/multiqc/multiqc.html |
.html report from MultiQC |