Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: update documentaion #507

Merged
merged 1 commit into from
Oct 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions docs/dna_biomarkers.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ See the [biomarkers hydra-genetics module](https://hydra-genetics-biomarker.read

## Pipeline output files:

* `results/dna/tmb/{sample}_{type}.TMB.txt`
* `results/dna/msi/{sample}_{type}.msisensor_pro.score.tsv`
* `results/dna/hrd/{sample}_{type}.purecn.scarhrd_cnvkit_score.txt`
* `results/dna/hrd/{sample}_{type}.pathology.scarhrd_cnvkit_score.txt`
* `results/dna/{sample}_{type}/tmb/{sample}_{type}.TMB.txt`
* `results/dna/{sample}_{type}/msi/{sample}_{type}.msisensor_pro.score.tsv`
* `results/dna/{sample}_{type}/hrd/{sample}_{type}.purecn.scarhrd_cnvkit_score.txt`
* `results/dna/{sample}_{type}/hrd/{sample}_{type}.pathology.scarhrd_cnvkit_score.txt`

## Tumor mutational burden (TMB)
TMB is a measure of the frequency of somatic mutations and is usually measured as mutations per megabase. The size of design of the exons is approximately 1.55Mb. However, by validating the TMB for GMS560 against Foundation One and TSO500 TMB the effective design size is adjusted to 1.19Mb. This is based on the slope (0.84) of the correlation between TSO500 data and the number of variants in the TMB analysis. The TMB is calculated using the in-house script **[tmb.py](https://github.com/hydra-genetics/biomarker/blob/develop/workflow/scripts/tmb.py)** ([rule](https://github.com/hydra-genetics/biomarker/blob/develop/workflow/rules/tmb.smk)) which counts the number of nsSNVs and divide by the adjusted design size. Variants must fulfill the following criteria to be counted:
Expand All @@ -36,7 +36,7 @@ The result is the TMB calculated using nsSNVs. However, the variants passing all

### Result file

* `results/dna/tmb/{sample}_{type}.TMB.txt`
* `results/dna/{sample}_{type}/tmb/{sample}_{type}.TMB.txt`

## Microsatellite instability (MSI)
To determine MSS or MSI status of the samples the percentage of sites that have microsatellite instability are calculated using **[MSIsensor-pro]([https://github.com/xjtu-omics/msisensor-pro])** v1.1.a. When more than 10% of the sites are instable the sample is determined to have MSI status. The program uses a panel of normal to determine the normal level of instability in the used sites.
Expand All @@ -48,7 +48,7 @@ To determine MSS or MSI status of the samples the percentage of sites that have

### Result file

* `results/dna/msi/{sample}_{type}.msisensor_pro.score.tsv`
* `results/dna/{sample}_{type}/msi/{sample}_{type}.msisensor_pro.score.tsv`

## Homologous recombination deficiency (HRD) - in development
**OBS! The Homologous recombination deficiency score is still under development**
Expand All @@ -69,7 +69,7 @@ A homologous recombination deficiency score is calculated using **[scarHRD](http

### Result files

* `results/dna/hrd/{sample}_{type}.purecn.scarhrd_cnvkit_score.txt`
* `results/dna/hrd/{sample}_{type}.pathology.scarhrd_cnvkit_score.txt`
* `results/dna/{sample}_{type}/hrd/{sample}_{type}.purecn.scarhrd_cnvkit_score.txt`
* `results/dna/{sample}_{type}/hrd/{sample}_{type}.pathology.scarhrd_cnvkit_score.txt`

<br />
103 changes: 70 additions & 33 deletions docs/dna_cnvs.md

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions docs/dna_fusions.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ See the [fusions hydra-genetics module](https://hydra-genetics-fusions.readthedo

## Pipeline output files:

* `results/dna/fusion/{sample}_{type}.gene_fuse_report.tsv` (with UMI option only)
* `results/dna/fusion/{sample}_{type}.fuseq_wes.report.csv`
* `results/dna/{sample}_{type}/fusion/{sample}_{type}.gene_fuse_report.tsv` (with UMI option only)
* `results/dna/{sample}_{type}/fusion/{sample}_{type}.fuseq_wes.report.csv`

## Fusions calling using GeneFuse
DNA fusion calling is performed by **[GeneFuse](https://github.com/OpenGene/GeneFuse)** v0.6.1 on fastq-files. It uses a gene transcript target file to limit the number of targets to analyze.
Expand Down Expand Up @@ -49,7 +49,7 @@ The output from GeneFuse is filtered and then reported into a fusion report usin

### Result file

* `results/dna/fusion/{sample}_{type}.gene_fuse_report.tsv`
* `results/dna/{sample}_{type}/fusion/{sample}_{type}.gene_fuse_report.tsv`

<br />

Expand Down Expand Up @@ -98,6 +98,6 @@ The output from FuSeq_WES is filtered and then reported into a fusion report usi

### Result file

* `results/dna/fusion/{sample}_{type}.fuseq_wes_report.tsv`
* `results/dna/{sample}_{type}/fusion/{sample}_{type}.fuseq_wes_report.tsv`

<br />
4 changes: 2 additions & 2 deletions docs/dna_qc.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ See the [qc hydra-genetics module](https://hydra-genetics-qc.readthedocs.io/en/l
## Pipeline output files:

* `results/dna/qc/multiqc_DNA.html`
* `results/dna/qc/{sample}_{type}.coverage_and_mutations.tsv`
* `results/dna/{sample}_{type}/qc/{sample}_{type}.coverage_and_mutations.tsv`
* `gvcf_dna/{sample}_{type}.mosdepth.g.vcf.gz`

## MultiQC
Expand Down Expand Up @@ -107,6 +107,6 @@ levels:

### Result file

* `results/dna/qc/{sample}_{type}.coverage_and_mutations.tsv`
* `results/dna/{sample}_{type}/qc/{sample}_{type}.coverage_and_mutations.tsv`

<br />
8 changes: 4 additions & 4 deletions docs/dna_snv_indels.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ See the [snv_indels hydra-genetics module](https://hydra-genetics-snv-indels.rea

## Pipeline output files:

* `results/dna/vcf/{sample}_{type}.annotated.exon_only.filter.hard_filter.codon_snv.vcf`
* `results/dna/vcf/{sample}_{type}.annotated.exon_only.filter.hard_filter.codon_snv.qci.vcf`
* `results/dna/{sample}_{type}/vcf/{sample}_{type}.annotated.exon_only.filter.hard_filter.codon_snv.vcf`
* `results/dna/{sample}_{type}/vcf/{sample}_{type}.annotated.exon_only.filter.hard_filter.codon_snv.qci.vcf`
* `bam_dna/mutect2_indel_bam/{sample}_{type}.bam`


Expand Down Expand Up @@ -203,14 +203,14 @@ Two or more variants affecting the same codon can have different clinical implic

### Result file

* `results/dna/vcf/{sample}_{type}.annotated.exon_only.filter.hard_filter.codon_snv.vcf`
* `results/dna/{sample}_{type}/vcf/{sample}_{type}.annotated.exon_only.filter.hard_filter.codon_snv.vcf`

## QCI AF correction of vcf
The clinical interpretation tool QCI calculates allele frequency from the AD FORMAT field instead of using the AF FORMAT field supplied by the callers. This has shown to be wrong especially for INDELs. The AD field is therefore corrected so that the allele frequency based on the AD field corresponds to the AF field. This correction of the vcf file is performed by an the in-house script [fix_vcf_ad_for_qci.py](https://github.com/genomic-medicine-sweden/Twist_Solid/blob/develop/workflow/scripts/fix_vcf_ad_for_qci.py) ([rule and config](softwares.md#fix_vcf_ad_for_qci)).

### Result file

* `results/dna/vcf/{sample}_{type}.annotated.exon_only.filter.hard_filter.codon_snv.qci.vcf`
* `results/dna/{sample}_{type}/vcf/{sample}_{type}.annotated.exon_only.filter.hard_filter.codon_snv.qci.vcf`

## GATK Mutect2 variant bam file
When **[GATK Mutect2](https://gatk.broadinstitute.org/hc/en-us/articles/360037593851-Mutect2)** finds INDEL candidates it realignes reads in this regions and outputs a realigned bam-file covering these INDEL regions. This makes it possible to inspect INDELs called by Mutect2 in [IGV](https://software.broadinstitute.org/software/igv/). As Mutect2 runs on individual chromosomes these bam-files are then merged, sorted and indexed before.
Expand Down
6 changes: 6 additions & 0 deletions docs/images/cnvs.dot
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ digraph snakemake_dag {
48[label = "cnvkit_vcf", color = "0.17 0.6 0.85", style="rounded"];
49[label = "cnvkit_call", color = "0.23 0.6 0.85", style="rounded"];
50[label = "cnvkit_batch", color = "0.59 0.6 0.85", style="rounded"];
52[label = "jumble_run", color = "0.59 0.6 0.85", style="rounded"];
53[label = "jumble_cnvkit_vcf", color = "0.17 0.6 0.85", style="rounded"];
51[label = "bcftools_filter_exclude_region", color = "0.37 0.6 0.85", style="rounded"];
56[label = "gatk_to_vcf", color = "0.18 0.6 0.85", style="rounded"];
57[label = "gatk_model_segments", color = "0.63 0.6 0.85", style="rounded"];
Expand All @@ -25,6 +27,10 @@ digraph snakemake_dag {
90[label = "merge_json", color = "0.54 0.6 0.85", style="rounded"];
91[label = "cnv_json", color = "0.63 0.6 0.85", style="rounded"];
95[label = "cnvkit_scatter", color = "0.66 0.6 0.85", style="rounded"];
200 -> 52
52 -> 53
53 -> 46
202 -> 53
200 -> 42
41 -> 40
42 -> 41
Expand Down
Binary file modified docs/images/cnvs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
109 changes: 19 additions & 90 deletions docs/references.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,7 @@ The following reference files, panel of normals and design files are needed to r
|_ _| <div id="fuseq_wes_transcript_black_list">transcript black list</div> | `fuseq_wes_transcript_black_list.txt` |
| <div id="hotspot_file">hotspot_annotation</div> | hotspots | `Hotspots_combined_regions_nodups.csv` |
| <div id="hotspot_report">hotspot_report</div> | hotspot_mutations | `Hotspots_combined_regions_nodups.csv` |
| <div id="jumble_run">jumble_run</div> | normal_reference | `jumble.combined.filtered.50.PoN.hg19.RDS` |
| <div id="manta_design_bed">manta_config_t</div> | extra | `pool1_pool2.sort.merged.padded20.cnv200.hg19.split_fusion_genes.210608.bed.gz` |
| <div id="msisensor_pro_pon">msisensor_pro</div> | PoN | `Msisensor_pro_reference_nextseq_36.list_baseline` |
| <div id="purecn_estimation_mapping_pon">purecn</div> | extra | `mapping_bias_nextseq_27_hg19.rds` |
Expand Down Expand Up @@ -161,22 +162,22 @@ wget https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/GRCh37_gencode_v1

Files needed by Twist Solid that are generated by the reference pipeline are listed below. Many of the panel of normals are available to download from the Uppsala Owncloud solution but should preferably be generated on in-house data.
<br />
**Note**
The input needed to create these files depend on the result files of the main pipeline itself, meaning that the samples need to be analysed with dummy panel of normals first.
<br />
**Result files**

* `cnvkit.PoN.cnn`
* `gatk_cnv_panel_of_normal.hdf5`
* `Msisensor_pro_reference.list_baseline`
* `background_panel.tsv`
* `artifact_panel.tsv`
* `svdb_cnv.vcf`
* `normalDB_hg19.rds`
* `mapping_bias_nextseq_27_hg19.rds`
* `result/cnvkit.PoN.cnn`
* `result/gatk_cnv_panel_of_normal.hdf5`
* `result/design.preprocessed.interval_list`
* `result/jumble.PoN.RDS`
* `result/Msisensor_pro_reference.list_baseline`
* `result/background_panel.tsv`
* `result/artifact_panel.tsv`
* `result/svdb_cnv.vcf`
* `result/mapping_bias.rds`
* `result/purecn_normal_db.rds`
* `result/purecn_targets_intervals.txt`

### Create samples and units
Files and samples used in the generation of the panel of normals are specified in samples.tsv and units.tsv. Required files and header names are listed down below.
Files and samples used in the generation of the panel of normals are specified in samples.tsv and units.tsv. Required files are listed down below.
Adapt out file specification (`workflow/rules/common_references.smk`) and comment out files that should not be generated.

### Run command
Expand All @@ -186,16 +187,10 @@ snakemake --profile profiles/uppsala_ref/ -s workflow/Snakefile_references.smk
```
<br />
**Note**
The `units.tsv` file needs to be adapted depending which panel of normals are created (see below) and should contain all the samples needed to create the panel of normals.
The `units.tsv` file needs to be adapted depending which panel of normals are created and should contain all the samples needed to create the panel of normals.

### CNVkit

**units.tsv**

| Header | Data | Description |
|-|-|-|
| bam | `bam_dna/{sample}_{type}.bam` | Merged bam files created as output of the Twist Solid pipeline from normal FFPE samples |

**Reference files**

* design bedfile
Expand All @@ -204,12 +199,6 @@ The `units.tsv` file needs to be adapted depending which panel of normals are cr

### GATK CNV

**units.tsv**

| Header | Data | Description |
|-|-|-|
| bam | `bam_dna/{sample}_{type}.bam` | Merged bam files created as output of the Twist Solid pipeline from normal FFPE samples |

**Reference files**

* design bedfile
Expand All @@ -218,12 +207,6 @@ The `units.tsv` file needs to be adapted depending which panel of normals are cr

### MSISensor-pro

**units.tsv**

| Header | Data | Description |
|-|-|-|
| bam | `bam_dna/{sample}_{type}.bam` | Merged bam files created as output of the Twist Solid pipeline from normal FFPE samples |

**Reference files**

* design bedfile
Expand All @@ -237,11 +220,7 @@ The `units.tsv` file needs to be adapted depending which panel of normals are cr

### SVDB

**units.tsv**

| Header | Data | Description |
|-|-|-|
| cnv_vcf | `results/dna/additional_files/cnv/{sample}_{type}/{sample}_{type}.pathology_purecn.svdb_query.vcf` | SVDB merged CNV vcf files created as output of the <br />Twist Solid pipeline from both normal and <br />tumor FFPE samples |
Should be made up of both normal and tumor FFPE samples!

**Software settings**

Expand All @@ -251,19 +230,11 @@ The `units.tsv` file needs to be adapted depending which panel of normals are cr

### Artifacts

**units.tsv**

| Header | Data | Description |
|-|-|-|
| vcf | `results/dna/additional_files/vcf/{sample}_{type}.annotated.vcf.gz` | Unfiltered and merged vcf files created as output of the Twist Solid pipeline <br />from normal FFPE samples |
Based on unfiltered and merged vcf files from normal FFPE samples

### Background

**units.tsv**

| Header | Data | Description |
|-|-|-|
| gvcf | `gvcf_dna/{sample}_{type}.mosdepth.g.vcf.gz` | Genome vcf files from Mutect2 created as output of the Twist Solid pipeline <br />from normal FFPE samples |
Based on genome vcf files from Mutect2 from normal FFPE samples

**Software settings**

Expand All @@ -273,50 +244,8 @@ The `units.tsv` file needs to be adapted depending which panel of normals are cr
| max_af | 0.015 | Max allele frequency to be included (default: 0.015) |

### PureCN
**OBS!** The vcf files used for purecn are not the same as in other steps meaning that currently the refence pipeline will have to be run twice. Also the vcfs used are not a final output file of the pipeline so use --notemp when running.

#### Target interval file
Target interval file for hg19 with 25000 in target bin size also including of target regions.
```bash
singularity docker://hydragenetics/purecn:2.2.0 Rscript $PURECN/IntervalFile.R --fasta ${fasta_ref} --in-file ${design_bed} --out-file ${intervals_file} --export ${optimized_bed} --genome hg19 --average-off-target-width 25000 --off-target
```

#### PureCN Panel of normal

**units.tsv**

| Header | Data | Description |
|-|-|-|
| bam | `bam_dna/{sample}_{type}.bam` | Merged bam files created as output of the Twist Solid pipeline from <br />normal FFPE samples |
| vcf | `cnv_sv/purecn_modify_vcf/`<br />`{sample}_{type}.normalized.sorted.vep_annotated.filter.snv_hard_filter_purecn`<br />`.bcftools_annotated_purecn.mbq.vcf.gz` | Hard filtered vcf files for purecn created as intermediate output of the Twist Solid pipeline <br />from normal FFPE samples |

**Software settings**

| Options | Value | Description |
|-|-|-|
| intervals | Target interval file | File created by the command described above |
Made up of bam and vcf files from normal samples

## Pipeline specific files
These are design files and other pipeline specific files only available to download from out [git](https://github.com/genomic-medicine-sweden/Twist_Solid_pipeline_files) or the Uppsala Owncloud solution.

| File type | File | Description |
|-|-|-|
| Design files | `pool1_pool2.sort.merged.padded20.cnv200.hg19.split_fusion_genes`<br />`.reannotated.210608.bed` | Design bed |
| | `pool1_pool2.sort.merged.padded20.cnv200.hg19.split_fusion_genes`<br />`.MUC6_31_rm.exon_only.reannotated.210608.bed` | Design bed file containing only exons |
| | `pool1_pool2.sort.merged.padded20.cnv200.hg19.split_fusion_genes`<br />`.MUC6_31_rm.exon_only.reannotated.210608.interval_list` | Design interval file containing only exons |
| | `pool1_pool2_nochr_3c.sort.merged.padded20.cnv400.hg19.210311.met`<br />`.annotated.bed.preprocessed.interval_list` | Design interval file used in GATK CNV PoN creation |
| | `Twist_RNA_Design5.annotated.bed` | RNA design bed |
| | `Twist_RNA_Design5.annotated.interval_list` | RNA design interval file |
|_ _| `ID_SNPs.bed` | List of RNA ID SNPs |
| Hotspots | `Hotspots_combined_regions_nodups.csv` | Positions, transcript information, etc on clinically relevant regions |
| GeneFuse | `GMS560_fusion_w_pool2.hg19.221117.csv` | Genes and its exonic positions included in fusion calling |
| | `filter_fusions_20221114.csv` | Filtering criteria for false positive prone fusion partners |
| FuSeq_WES | `fuseq_params.txt` | Filtering parameters used by FuSeq_WES |
| FuSeq_WES_report | `fuseq_wes_gene_white_list.txt` | Gene list for filtering of fusion |
| | `false_positive_fusion_pairs.txt` | Gene list for filtering of fusion |
|_ _| `fuseq_wes_transcript_black_list.txt` | Transcripts that should not be used in annotation |
| CNVkit | `cnvkit_germline_blacklist_20221221.bed` | List of regions excluded from the germline vcf file |
| GATK CNV | `gnomad_SNP_0.001_target.annotated.interval_list` | Bed file with CNV backbone SNPs which are selected from <br />GnomAD with over 0.1% global population frequency |
| Small CNV deletions | `cnv_deletion_genes.tsv` | File defining gene and its surrounding regions used for <br />small CNV deletion. Same deletion genes as used in the <br />CNV deletion reports |
| Small CNV amplifications | `cnv_amplification_genes.tsv` | File defining gene and its surrounding regions used for <br />small CNV amplification. Same amplification genes as used in the <br />CNV reports |
| Report RNA fusions | `Twist_RNA_fusionpartners.bed` | Bed file used for annotation of fusion partner exons |
Premade panel of normals, design files and references can be download using the hydra-genetics tools. Design files can also be downloaded for our [github repo](https://github.com/genomic-medicine-sweden/Twist_Solid_pipeline_files). Check the [config yaml files](https://github.com/genomic-medicine-sweden/Twist_Solid/tree/develop/config) in the Twist_Solid repo for the latest files.
2 changes: 1 addition & 1 deletion docs/rna_exon_skipping.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,6 @@ Exon skipping cannot be called by the fusion callers as they only search for fus

## Result file

* `results/rna/fusion/{sample}_{type}.exon_skipping.tsv`
* `results/rna/{sample}_{type}/fusion/{sample}_{type}.exon_skipping.tsv`

<br />
Loading
Loading