genomic-medicine-sweden · monikaBrandt · Oct 17, 2024 · Oct 15, 2024
@@ -6,10 +6,10 @@ See the [biomarkers hydra-genetics module](https://hydra-genetics-biomarker.read
 
 ## Pipeline output files:
 
-* `results/dna/tmb/{sample}_{type}.TMB.txt`
-* `results/dna/msi/{sample}_{type}.msisensor_pro.score.tsv`
-* `results/dna/hrd/{sample}_{type}.purecn.scarhrd_cnvkit_score.txt`
-* `results/dna/hrd/{sample}_{type}.pathology.scarhrd_cnvkit_score.txt`
+* `results/dna/{sample}_{type}/tmb/{sample}_{type}.TMB.txt`
+* `results/dna/{sample}_{type}/msi/{sample}_{type}.msisensor_pro.score.tsv`
+* `results/dna/{sample}_{type}/hrd/{sample}_{type}.purecn.scarhrd_cnvkit_score.txt`
+* `results/dna/{sample}_{type}/hrd/{sample}_{type}.pathology.scarhrd_cnvkit_score.txt`
 
 ## Tumor mutational burden (TMB)
 TMB is a measure of the frequency of somatic mutations and is usually measured as mutations per megabase. The size of design of the exons is approximately 1.55Mb. However, by validating the TMB for GMS560 against Foundation One and TSO500 TMB the effective design size is adjusted to 1.19Mb. This is based on the slope (0.84) of the correlation between TSO500 data and the number of variants in the TMB analysis. The TMB is calculated using the in-house script **[tmb.py](https://github.com/hydra-genetics/biomarker/blob/develop/workflow/scripts/tmb.py)** ([rule](https://github.com/hydra-genetics/biomarker/blob/develop/workflow/rules/tmb.smk)) which counts the number of nsSNVs and divide by the adjusted design size. Variants must fulfill the following criteria to be counted:
@@ -36,7 +36,7 @@ The result is the TMB calculated using nsSNVs. However, the variants passing all
 
 ### Result file
 
-* `results/dna/tmb/{sample}_{type}.TMB.txt`
+* `results/dna/{sample}_{type}/tmb/{sample}_{type}.TMB.txt`
 
 ## Microsatellite instability (MSI)
 To determine MSS or MSI status of the samples the percentage of sites that have microsatellite instability are calculated using **[MSIsensor-pro]([https://github.com/xjtu-omics/msisensor-pro])** v1.1.a. When more than 10% of the sites are instable the sample is determined to have MSI status. The program uses a panel of normal to determine the normal level of instability in the used sites.
@@ -48,7 +48,7 @@ To determine MSS or MSI status of the samples the percentage of sites that have
 
 ### Result file
 
-* `results/dna/msi/{sample}_{type}.msisensor_pro.score.tsv`
+* `results/dna/{sample}_{type}/msi/{sample}_{type}.msisensor_pro.score.tsv`
 
 ## Homologous recombination deficiency (HRD) - in development
 **OBS! The Homologous recombination deficiency score is still under development**  
@@ -69,7 +69,7 @@ A homologous recombination deficiency score is calculated using **[scarHRD](http
 
 ### Result files
 
-* `results/dna/hrd/{sample}_{type}.purecn.scarhrd_cnvkit_score.txt`
-* `results/dna/hrd/{sample}_{type}.pathology.scarhrd_cnvkit_score.txt`
+* `results/dna/{sample}_{type}/hrd/{sample}_{type}.purecn.scarhrd_cnvkit_score.txt`
+* `results/dna/{sample}_{type}/hrd/{sample}_{type}.pathology.scarhrd_cnvkit_score.txt`
 
 <br />
@@ -6,8 +6,8 @@ See the [fusions hydra-genetics module](https://hydra-genetics-fusions.readthedo
 
 ## Pipeline output files:
 
-* `results/dna/fusion/{sample}_{type}.gene_fuse_report.tsv` (with UMI option only)
-* `results/dna/fusion/{sample}_{type}.fuseq_wes.report.csv`
+* `results/dna/{sample}_{type}/fusion/{sample}_{type}.gene_fuse_report.tsv` (with UMI option only)
+* `results/dna/{sample}_{type}/fusion/{sample}_{type}.fuseq_wes.report.csv`
 
 ## Fusions calling using GeneFuse
 DNA fusion calling is performed by **[GeneFuse](https://github.com/OpenGene/GeneFuse)** v0.6.1 on fastq-files. It uses a gene transcript target file to limit the number of targets to analyze.
@@ -49,7 +49,7 @@ The output from GeneFuse is filtered and then reported into a fusion report usin
 
 ### Result file
 
-* `results/dna/fusion/{sample}_{type}.gene_fuse_report.tsv`
+* `results/dna/{sample}_{type}/fusion/{sample}_{type}.gene_fuse_report.tsv`
 
 <br />
 
@@ -98,6 +98,6 @@ The output from FuSeq_WES is filtered and then reported into a fusion report usi
 
 ### Result file
 
-* `results/dna/fusion/{sample}_{type}.fuseq_wes_report.tsv`
+* `results/dna/{sample}_{type}/fusion/{sample}_{type}.fuseq_wes_report.tsv`
 
 <br />
@@ -7,7 +7,7 @@ See the [qc hydra-genetics module](https://hydra-genetics-qc.readthedocs.io/en/l
 ## Pipeline output files:
 
 * `results/dna/qc/multiqc_DNA.html`
-* `results/dna/qc/{sample}_{type}.coverage_and_mutations.tsv`
+* `results/dna/{sample}_{type}/qc/{sample}_{type}.coverage_and_mutations.tsv`
 * `gvcf_dna/{sample}_{type}.mosdepth.g.vcf.gz`
 
 ## MultiQC
@@ -107,6 +107,6 @@ levels:
 
 ### Result file
 
-* `results/dna/qc/{sample}_{type}.coverage_and_mutations.tsv`
+* `results/dna/{sample}_{type}/qc/{sample}_{type}.coverage_and_mutations.tsv`
 
 <br />
@@ -6,8 +6,8 @@ See the [snv_indels hydra-genetics module](https://hydra-genetics-snv-indels.rea
 
 ## Pipeline output files:
 
-* `results/dna/vcf/{sample}_{type}.annotated.exon_only.filter.hard_filter.codon_snv.vcf`
-* `results/dna/vcf/{sample}_{type}.annotated.exon_only.filter.hard_filter.codon_snv.qci.vcf`
+* `results/dna/{sample}_{type}/vcf/{sample}_{type}.annotated.exon_only.filter.hard_filter.codon_snv.vcf`
+* `results/dna/{sample}_{type}/vcf/{sample}_{type}.annotated.exon_only.filter.hard_filter.codon_snv.qci.vcf`
 * `bam_dna/mutect2_indel_bam/{sample}_{type}.bam`
 
 
@@ -203,14 +203,14 @@ Two or more variants affecting the same codon can have different clinical implic
 
 ### Result file
 
-* `results/dna/vcf/{sample}_{type}.annotated.exon_only.filter.hard_filter.codon_snv.vcf`
+* `results/dna/{sample}_{type}/vcf/{sample}_{type}.annotated.exon_only.filter.hard_filter.codon_snv.vcf`
 
 ## QCI AF correction of vcf
 The clinical interpretation tool QCI calculates allele frequency from the AD FORMAT field instead of using the AF FORMAT field supplied by the callers. This has shown to be wrong especially for INDELs. The AD field is therefore corrected so that the allele frequency based on the AD field corresponds to the AF field. This correction of the vcf file is performed by an the in-house script [fix_vcf_ad_for_qci.py](https://github.com/genomic-medicine-sweden/Twist_Solid/blob/develop/workflow/scripts/fix_vcf_ad_for_qci.py) ([rule and config](softwares.md#fix_vcf_ad_for_qci)).
 
 ### Result file
 
-* `results/dna/vcf/{sample}_{type}.annotated.exon_only.filter.hard_filter.codon_snv.qci.vcf`
+* `results/dna/{sample}_{type}/vcf/{sample}_{type}.annotated.exon_only.filter.hard_filter.codon_snv.qci.vcf`
 
 ## GATK Mutect2 variant bam file
 When **[GATK Mutect2](https://gatk.broadinstitute.org/hc/en-us/articles/360037593851-Mutect2)** finds INDEL candidates it realignes reads in this regions and outputs a realigned bam-file covering these INDEL regions. This makes it possible to inspect INDELs called by Mutect2 in [IGV](https://software.broadinstitute.org/software/igv/). As Mutect2 runs on individual chromosomes these bam-files are then merged, sorted and indexed before.

@@ -16,6 +16,8 @@ digraph snakemake_dag {
 	48[label = "cnvkit_vcf", color = "0.17 0.6 0.85", style="rounded"];
 	49[label = "cnvkit_call", color = "0.23 0.6 0.85", style="rounded"];
 	50[label = "cnvkit_batch", color = "0.59 0.6 0.85", style="rounded"];
+	52[label = "jumble_run", color = "0.59 0.6 0.85", style="rounded"];
+	53[label = "jumble_cnvkit_vcf", color = "0.17 0.6 0.85", style="rounded"];
 	51[label = "bcftools_filter_exclude_region", color = "0.37 0.6 0.85", style="rounded"];
 	56[label = "gatk_to_vcf", color = "0.18 0.6 0.85", style="rounded"];
 	57[label = "gatk_model_segments", color = "0.63 0.6 0.85", style="rounded"];
@@ -25,6 +27,10 @@ digraph snakemake_dag {
 	90[label = "merge_json", color = "0.54 0.6 0.85", style="rounded"];
 	91[label = "cnv_json", color = "0.63 0.6 0.85", style="rounded"];
 	95[label = "cnvkit_scatter", color = "0.66 0.6 0.85", style="rounded"];
+	200 -> 52
+	52 -> 53
+	53 -> 46
+	202 -> 53
 	200 -> 42
 	41 -> 40
 	42 -> 41

@@ -83,6 +83,7 @@ The following reference files, panel of normals and design files are needed to r
 |_ _| <div id="fuseq_wes_transcript_black_list">transcript black list</div> | `fuseq_wes_transcript_black_list.txt` |
 | <div id="hotspot_file">hotspot_annotation</div> | hotspots | `Hotspots_combined_regions_nodups.csv` |
 | <div id="hotspot_report">hotspot_report</div> | hotspot_mutations | `Hotspots_combined_regions_nodups.csv` |
+| <div id="jumble_run">jumble_run</div> | normal_reference | `jumble.combined.filtered.50.PoN.hg19.RDS` |
 | <div id="manta_design_bed">manta_config_t</div> | extra | `pool1_pool2.sort.merged.padded20.cnv200.hg19.split_fusion_genes.210608.bed.gz` |
 | <div id="msisensor_pro_pon">msisensor_pro</div> | PoN | `Msisensor_pro_reference_nextseq_36.list_baseline` |
 | <div id="purecn_estimation_mapping_pon">purecn</div> | extra | `mapping_bias_nextseq_27_hg19.rds` |
@@ -161,22 +162,22 @@ wget https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/GRCh37_gencode_v1
 
 Files needed by Twist Solid that are generated by the reference pipeline are listed below. Many of the panel of normals are available to download from the Uppsala Owncloud solution but should preferably be generated on in-house data.  
 <br />
-**Note**  
-The input needed to create these files depend on the result files of the main pipeline itself, meaning that the samples need to be analysed with dummy panel of normals first.  
-<br />
 **Result files**
 
-* `cnvkit.PoN.cnn`
-* `gatk_cnv_panel_of_normal.hdf5`
-* `Msisensor_pro_reference.list_baseline`
-* `background_panel.tsv`
-* `artifact_panel.tsv`
-* `svdb_cnv.vcf`
-* `normalDB_hg19.rds`
-* `mapping_bias_nextseq_27_hg19.rds`
+* `result/cnvkit.PoN.cnn`
+* `result/gatk_cnv_panel_of_normal.hdf5`
+* `result/design.preprocessed.interval_list`
+* `result/jumble.PoN.RDS`
+* `result/Msisensor_pro_reference.list_baseline`
+* `result/background_panel.tsv`
+* `result/artifact_panel.tsv`
+* `result/svdb_cnv.vcf`
+* `result/mapping_bias.rds`
+* `result/purecn_normal_db.rds`
+* `result/purecn_targets_intervals.txt`
 
 ### Create samples and units
-Files and samples used in the generation of the panel of normals are specified in samples.tsv and units.tsv. Required files and header names are listed down below.
+Files and samples used in the generation of the panel of normals are specified in samples.tsv and units.tsv. Required files are listed down below.
 Adapt out file specification (`workflow/rules/common_references.smk`) and comment out files that should not be generated.
 
 ### Run command
@@ -186,16 +187,10 @@ snakemake --profile profiles/uppsala_ref/ -s workflow/Snakefile_references.smk
 ```
 <br />
 **Note**  
-The `units.tsv` file needs to be adapted depending which panel of normals are created (see below) and should contain all the samples needed to create the panel of normals.
+The `units.tsv` file needs to be adapted depending which panel of normals are created and should contain all the samples needed to create the panel of normals.
 
 ### CNVkit
 
-**units.tsv**
-
-| Header | Data | Description |
-|-|-|-|
-| bam | `bam_dna/{sample}_{type}.bam` | Merged bam files created as output of the Twist Solid pipeline from normal FFPE samples |
-
 **Reference files**
 
 * design bedfile
@@ -204,12 +199,6 @@ The `units.tsv` file needs to be adapted depending which panel of normals are cr
 
 ### GATK CNV
 
-**units.tsv**
-
-| Header | Data | Description |
-|-|-|-|
-| bam | `bam_dna/{sample}_{type}.bam` | Merged bam files created as output of the Twist Solid pipeline from normal FFPE samples |
-
 **Reference files**
 
 * design bedfile
@@ -218,12 +207,6 @@ The `units.tsv` file needs to be adapted depending which panel of normals are cr
 
 ### MSISensor-pro
 
-**units.tsv**
-
-| Header | Data | Description |
-|-|-|-|
-| bam | `bam_dna/{sample}_{type}.bam` | Merged bam files created as output of the Twist Solid pipeline from normal FFPE samples |
-
 **Reference files**
 
 * design bedfile
@@ -237,11 +220,7 @@ The `units.tsv` file needs to be adapted depending which panel of normals are cr
 
 ### SVDB
 
-**units.tsv**
-
-| Header | Data | Description |
-|-|-|-|
-| cnv_vcf | `results/dna/additional_files/cnv/{sample}_{type}/{sample}_{type}.pathology_purecn.svdb_query.vcf` | SVDB merged CNV vcf files created as output of the <br />Twist Solid pipeline from both normal and <br />tumor FFPE samples |
+Should be made up of both normal and tumor FFPE samples!
 
 **Software settings**
 
@@ -251,19 +230,11 @@ The `units.tsv` file needs to be adapted depending which panel of normals are cr
 
 ### Artifacts
 
-**units.tsv**
-
-| Header | Data | Description |
-|-|-|-|
-| vcf | `results/dna/additional_files/vcf/{sample}_{type}.annotated.vcf.gz` | Unfiltered and merged vcf files created as output of the Twist Solid pipeline <br />from normal FFPE samples |
+Based on unfiltered and merged vcf files from normal FFPE samples
 
 ### Background
 
-**units.tsv**
-
-| Header | Data | Description |
-|-|-|-|
-| gvcf | `gvcf_dna/{sample}_{type}.mosdepth.g.vcf.gz` | Genome vcf files from Mutect2 created as output of the Twist Solid pipeline <br />from normal FFPE samples |
+Based on genome vcf files from Mutect2 from normal FFPE samples
 
 **Software settings**
 
@@ -273,50 +244,8 @@ The `units.tsv` file needs to be adapted depending which panel of normals are cr
 | max_af | 0.015 | Max allele frequency to be included (default: 0.015) |
 
 ### PureCN
-**OBS!** The vcf files used for purecn are not the same as in other steps meaning that currently the refence pipeline will have to be run twice. Also the vcfs used are not a final output file of the pipeline so use --notemp when running.
-
-#### Target interval file
-Target interval file for hg19 with 25000 in target bin size also including of target regions.
-```bash
-singularity docker://hydragenetics/purecn:2.2.0 Rscript $PURECN/IntervalFile.R --fasta ${fasta_ref} --in-file ${design_bed} --out-file ${intervals_file} --export ${optimized_bed} --genome hg19 --average-off-target-width 25000 --off-target
-```
-
-#### PureCN Panel of normal
 
-**units.tsv**
-
-| Header | Data | Description |
-|-|-|-|
-| bam | `bam_dna/{sample}_{type}.bam` | Merged bam files created as output of the Twist Solid pipeline from <br />normal FFPE samples |
-| vcf | `cnv_sv/purecn_modify_vcf/`<br />`{sample}_{type}.normalized.sorted.vep_annotated.filter.snv_hard_filter_purecn`<br />`.bcftools_annotated_purecn.mbq.vcf.gz` | Hard filtered vcf files for purecn created as intermediate output of the Twist Solid pipeline <br />from normal FFPE samples |
-
-**Software settings**
-
-| Options | Value | Description |
-|-|-|-|
-| intervals | Target interval file | File created by the command described above |
+Made up of bam and vcf files from normal samples
 
 ## Pipeline specific files
-These are design files and other pipeline specific files only available to download from out [git](https://github.com/genomic-medicine-sweden/Twist_Solid_pipeline_files) or the Uppsala Owncloud solution.
-
-| File type | File | Description |
-|-|-|-|
-| Design files | `pool1_pool2.sort.merged.padded20.cnv200.hg19.split_fusion_genes`<br />`.reannotated.210608.bed` | Design bed |
-| | `pool1_pool2.sort.merged.padded20.cnv200.hg19.split_fusion_genes`<br />`.MUC6_31_rm.exon_only.reannotated.210608.bed` | Design bed file containing only exons |
-| | `pool1_pool2.sort.merged.padded20.cnv200.hg19.split_fusion_genes`<br />`.MUC6_31_rm.exon_only.reannotated.210608.interval_list` | Design interval file containing only exons |
-| | `pool1_pool2_nochr_3c.sort.merged.padded20.cnv400.hg19.210311.met`<br />`.annotated.bed.preprocessed.interval_list` | Design interval file used in GATK CNV PoN creation |
-| | `Twist_RNA_Design5.annotated.bed` | RNA design bed |
-| | `Twist_RNA_Design5.annotated.interval_list` | RNA design interval file |
-|_ _| `ID_SNPs.bed` | List of RNA ID SNPs |
-| Hotspots | `Hotspots_combined_regions_nodups.csv` | Positions, transcript information, etc on clinically relevant regions |
-| GeneFuse | `GMS560_fusion_w_pool2.hg19.221117.csv` | Genes and its exonic positions included in fusion calling |
-| | `filter_fusions_20221114.csv` | Filtering criteria for false positive prone fusion partners |
-| FuSeq_WES | `fuseq_params.txt` | Filtering parameters used by FuSeq_WES |
-| FuSeq_WES_report | `fuseq_wes_gene_white_list.txt` | Gene list for filtering of fusion |
-| | `false_positive_fusion_pairs.txt` | Gene list for filtering of fusion |
-|_ _| `fuseq_wes_transcript_black_list.txt` | Transcripts that should not be used in annotation |
-| CNVkit | `cnvkit_germline_blacklist_20221221.bed` | List of regions excluded from the germline vcf file |
-| GATK CNV | `gnomad_SNP_0.001_target.annotated.interval_list` | Bed file with CNV backbone SNPs which are selected from <br />GnomAD with over 0.1% global population frequency |
-| Small CNV deletions | `cnv_deletion_genes.tsv` | File defining gene and its surrounding regions used for <br />small CNV deletion. Same deletion genes as used in the <br />CNV deletion reports |
-| Small CNV amplifications | `cnv_amplification_genes.tsv` | File defining gene and its surrounding regions used for <br />small CNV amplification. Same amplification genes as used in the <br />CNV reports |
-| Report RNA fusions | `Twist_RNA_fusionpartners.bed` | Bed file used for annotation of fusion partner exons |
+Premade panel of normals, design files and references can be download using the hydra-genetics tools. Design files can also be downloaded for our [github repo](https://github.com/genomic-medicine-sweden/Twist_Solid_pipeline_files). Check the [config yaml files](https://github.com/genomic-medicine-sweden/Twist_Solid/tree/develop/config) in the Twist_Solid repo for the latest files.
@@ -8,6 +8,6 @@ Exon skipping cannot be called by the fusion callers as they only search for fus
 
 ## Result file
 
-* `results/rna/fusion/{sample}_{type}.exon_skipping.tsv`
+* `results/rna/{sample}_{type}/fusion/{sample}_{type}.exon_skipping.tsv`
 
 <br />