Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: stratification of VAFs #68

Merged
merged 41 commits into from
May 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
929a31d
feat: stratification of VAFs
johanneskoester Mar 26, 2024
b4727ea
Merge branch 'main' into feat/vaf-stratification
johanneskoester Mar 28, 2024
6895c24
feat: add vaf-field information
famosab Apr 12, 2024
77ee639
feat: get vaf-field param
famosab Apr 12, 2024
4240db7
feat: first steps towards VAF stratification
famosab Apr 12, 2024
810bc83
fix: snakefmt
famosab Apr 12, 2024
fbcce93
Merge branch 'main' into feat/vaf-stratification
famosab Apr 18, 2024
adef3aa
feat: add VAF fields
famosab Apr 19, 2024
ba973b8
fix: vaf field name
famosab Apr 19, 2024
490238e
feat: hand truth to calc_precision_recall
famosab Apr 19, 2024
da400fd
wip: further implementation vaf-stratification
famosab Apr 19, 2024
44d760c
feat: hand original query for VAFs
famosab Apr 19, 2024
0b8acf8
wip
famosab Apr 19, 2024
ba3be20
feat: get vaf status
famosab Apr 22, 2024
4413a27
feat: hand vaf status
famosab Apr 22, 2024
6c176c0
feat: change tables for vaf startification
famosab Apr 22, 2024
a7ec5fe
fix: change datavzrd config
famosab Apr 22, 2024
c76d4ee
fix: only get vaf if asked
famosab Apr 22, 2024
c702f74
fix: vaf variable
famosab Apr 22, 2024
6774c63
fix: colors for vaf
famosab Apr 22, 2024
7ad1caa
fix: add vaf-field to test
famosab Apr 22, 2024
b21beeb
Merge branch 'main' into feat/vaf-stratification
famosab Apr 22, 2024
bcc4019
fix: add truth and query index to rule calc_precision_recall
BiancaStoecker Apr 23, 2024
84b3422
fix: fix acces to AF for FORMAT fields
BiancaStoecker Apr 23, 2024
1785734
fix: fix error in counting for variants not stratified by vaf
BiancaStoecker Apr 24, 2024
5db7efb
fix: removed repetive code
BiancaStoecker Apr 24, 2024
d32535b
fix: modify description in report
BiancaStoecker Apr 24, 2024
8123d57
feat: display VAF labels as range instead of floats
BiancaStoecker May 3, 2024
b3aed2a
Fix: update version of datavzrd wrapper
BiancaStoecker May 3, 2024
161b51b
Merge
BiancaStoecker May 3, 2024
279f943
Merge branch 'main' into feat/vaf-stratification
BiancaStoecker May 6, 2024
0393abe
fix: added missing case for rename-contigs
BiancaStoecker May 6, 2024
4b6b425
fix: stratify by VAF if benchmarks supports it and at least one callset.
BiancaStoecker May 6, 2024
8cb88e7
fix: fix merge from before
BiancaStoecker May 6, 2024
15d1ac3
fix: count fp, tp, fn in int arrays, not float
BiancaStoecker May 10, 2024
ca3f49b
fix: fix error in binning vafs
BiancaStoecker May 10, 2024
fb3ee64
fix: vaf as tuple can also happen in INFO field
BiancaStoecker May 10, 2024
fefde3a
Fix: Picard LiftoverVcf can't handle bcf files
BiancaStoecker May 13, 2024
d036fc0
Fix: Liftover: Put CreateSequenceDictionary in separate rule and limi…
BiancaStoecker May 13, 2024
3d8276f
Fix: fmt
BiancaStoecker May 13, 2024
924791f
fix: extend .gitignore
BiancaStoecker May 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,6 @@ resources/**
.snakemake/**
.test/**
logs
logs/**
logs/**
data/**
report/**
3 changes: 3 additions & 0 deletions .test/config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@ variant-calls:
path: "resources/variants/seqc2-somatic/all.truth.format-added.vcf.gz"
benchmark: "seqc2-somatic-ea"
tumor_sample_name: "truth"
vaf-field:
- INFO
- TVAF

custom-benchmarks:
my-custom-benchmark: # custom benchmark name, choose freely (no whitespace allowed)
Expand Down
3 changes: 3 additions & 0 deletions config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@ variant-calls:
# Uncomment and point to file containing contig name replacements
# as needed for 'bcftools annotate --rename-chrs'
# rename-contigs: path/to/rename-contigs.txt
vaf-field: # needs to be checked with bcftools view -h
- FORMAT
- AF
# Uncomment if callset was produced using grch37 as reference, then a liftover to grch38 will be performed
# grch37: true

Expand Down
158 changes: 112 additions & 46 deletions workflow/resources/datavzrd/precision-recall-config.yte.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,9 @@ views:
results-plot:
dataset: results
desc: |
Precision and recall are calculated by matching variants between each callset
Precision and recall are calculated by matching variants between each callset
and truth, stratified by coverage categories. The matching ignores genotype
differences. Instead, genotype mismatches are displayed in the "genotype mismatch rate"
differences. Instead, genotype mismatches are displayed in the "genotype mismatch rate"
column of the tabular result representation or in the tooltips shown when hovering
a point.

Expand Down Expand Up @@ -69,47 +69,113 @@ views:

results-table:
dataset: results
desc: |
Precision and recall are calculated by matching variants between each callset
and truth, stratified by coverage categories. The matching ignores genotype
differences. Instead, genotype mismatches are displayed in the "genotype mismatch rate"
column.
page-size: 12
render-table:
columns:
callset:
plot:
heatmap:
scale: ordinal
color-scheme: category20
precision:
precision: 3
plot:
ticks:
scale: linear
recall:
precision: 3
plot:
ticks:
scale: linear
genotype_mismatch_rate:
?if params.somatic:
display-mode: hidden
?else:
display-mode: normal
precision: 3
plot:
ticks:
scale: linear
coverage:
plot:
heatmap:
scale: ordinal
domain:
- low
- medium
- high
range:
- "#c6dbef"
- "#9ecae1"
- "#6baed6"
?if params.vaf:
desc: |
Precision and recall are calculated by matching variants between each callset
BiancaStoecker marked this conversation as resolved.
Show resolved Hide resolved
and truth, stratified by coverage categories. Stratified by VAF.
page-size: 12
render-table:
columns:
callset:
plot:
heatmap:
scale: ordinal
color-scheme: category20
precision:
precision: 3
plot:
ticks:
scale: linear
recall:
precision: 3
plot:
ticks:
scale: linear
coverage:
plot:
heatmap:
scale: ordinal
domain:
- low
- medium
- high
range:
- "#c6dbef"
- "#9ecae1"
- "#6baed6"
vaf:
plot:
heatmap:
scale: linear
custom-content:
function(value, row) {
let lower = value - 0.1;
return `${lower.toFixed(1)}..${parseFloat(value).toFixed(1)}`
}
domain:
- 0.1
- 0.2
- 0.3
- 0.4
- 0.5
- 0.6
- 0.7
- 0.8
- 0.9
- 1.0
range:
- "#c0e6baff"
- "#abdda5ff"
- "#94d391ff"
- "#7bc77dff"
- "#60ba6cff"
- "#46ab5eff"
- "#329a51ff"
- "#208943ff"
- "#0e7735ff"
- "#1a833fff"
?else:
desc: |
Precision and recall are calculated by matching variants between each callset
and truth, stratified by coverage categories. The matching ignores genotype
differences. Instead, genotype mismatches are displayed in the "genotype mismatch rate"
column.
page-size: 12
render-table:
columns:
callset:
plot:
heatmap:
scale: ordinal
color-scheme: category20
precision:
precision: 3
plot:
ticks:
scale: linear
recall:
precision: 3
plot:
ticks:
scale: linear
genotype_mismatch_rate:
?if params.somatic:
display-mode: hidden
?else:
display-mode: normal
precision: 3
plot:
ticks:
scale: linear
coverage:
plot:
heatmap:
scale: ordinal
domain:
- low
- medium
- high
range:
- "#c6dbef"
- "#9ecae1"
- "#6baed6"
3 changes: 3 additions & 0 deletions workflow/resources/presets.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@ benchmarks:
bam-url: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG/data/WES/WES_EA_T_1.bwa.dedup.bam
target-regions: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG/technical/reference_genome/Exome_Target_bed/S07604624_Covered_human_all_v6_plus_UTR.liftover.to.hg38.bed6.gz
grch37: false
vaf-field:
- INFO # either FORMAT or INFO
- TVAF # name of tumor variant allele frequency

imgag-somatic-5perc:
genome: na12878-somatic
Expand Down
33 changes: 30 additions & 3 deletions workflow/rules/common.smk
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ def get_plot_cov_labels():
def label(name):
lower, upper = get_cov_interval(name)
if upper:
return f"{lower}-{upper - 1}"
return f"{lower}-{upper-1}"
return f"≥{lower}"

return {name: label(name) for name in coverages}
Expand Down Expand Up @@ -178,7 +178,7 @@ def get_callset(wildcards):
if get_somatic_status(wildcards):
return "results/normalized-variants/{callset}.gt-added.vcf.gz"
elif "rename-contigs" in callset:
return "results/normalized-variants/{callset}.replaced-contigs.bcf"
return "results/normalized-variants/{callset}.replaced-contigs.vcf.gz"
elif "grch37" in callset:
return "results/normalized-variants/{callset}.lifted.vcf.gz"
else:
Expand All @@ -188,7 +188,7 @@ def get_callset(wildcards):
def get_callset_correct_contigs(wildcards):
callset = config["variant-calls"][wildcards.callset]
if "rename-contigs" in callset:
return "results/normalized-variants/{callset}.replaced-contigs.bcf"
return "results/normalized-variants/{callset}.replaced-contigs.vcf.gz"
elif "grch37" in callset:
return "results/normalized-variants/{callset}.lifted.vcf.gz"
else:
Expand All @@ -199,6 +199,8 @@ def get_callset_correct_contigs_liftover(wildcards):
callset = config["variant-calls"][wildcards.callset]
if "grch37" in callset:
return "results/normalized-variants/{callset}.lifted.vcf.gz"
elif "rename-contigs" in callset:
return "results/normalized-variants/{callset}.replaced-contigs.vcf.gz"
else:
return get_raw_callset(wildcards)

Expand Down Expand Up @@ -390,6 +392,31 @@ def get_somatic_flag(wildcards):
return somatic_flag


def get_vaf_fields(wildcards):
vaf_callset = config["variant-calls"][wildcards.callset].get("vaf-field")

benchmark = config["variant-calls"][wildcards.callset]["benchmark"]
vaf_benchmark = benchmarks[benchmark].get("vaf-field")

# can return (None, None) if param not set
return (vaf_callset, vaf_benchmark)


def get_vaf_status(wildcards):
vaf_benchmark = benchmarks[wildcards.benchmark].get("vaf-field")
if vaf_benchmark is None:
return False
else:
callsets = get_benchmark_callsets(wildcards.benchmark)
vaf_callsets = [
config["variant-calls"][callset].get("vaf-field") for callset in callsets
]
if any(vaf_callset is not None for vaf_callset in vaf_callsets):
return True
else:
return False


def get_collect_stratifications_input(wildcards):
import json

Expand Down
31 changes: 26 additions & 5 deletions workflow/rules/eval.smk
Original file line number Diff line number Diff line change
@@ -1,25 +1,40 @@
rule get_reference_dict:
input:
reference="resources/reference/genome.fasta",
output:
"resources/reference/genome.fasta.dict",
log:
"logs/get-reference-dict.log",
conda:
"../envs/picard.yaml"
shell:
"picard CreateSequenceDictionary -R {input.reference} -O {input.reference}.dict &> {log}"


rule liftover_callset:
input:
callset=get_callset_correct_contigs,
liftover_chain="resources/liftover/GRCh37_to_GRCh38.chain.gz",
reference="resources/reference/genome.fasta",
reference_dict="resources/reference/genome.fasta.dict",
output:
"results/normalized-variants/{callset}.lifted.vcf.gz",
log:
"logs/liftover_callset/{callset}.log",
conda:
"../envs/picard.yaml"
resources:
mem_mb=64000,
shell:
"picard CreateSequenceDictionary -R {input.reference} -O {input.reference}.dict"
"picard LiftoverVcf -I {input.callset} -O {output} --CHAIN {input.liftover_chain} --REJECT {output}_rejected_variants.vcf -R {input.reference} &> {log}"
"picard LiftoverVcf -Xmx64g --MAX_RECORDS_IN_RAM 100000 -I {input.callset} -O {output} --CHAIN {input.liftover_chain} --REJECT {output}_rejected_variants.vcf -R {input.reference} &> {log}"


rule rename_contigs:
input:
calls=get_raw_callset,
repl_file=get_rename_contig_file,
output:
"results/normalized-variants/{callset}.replaced-contigs.bcf",
"results/normalized-variants/{callset}.replaced-contigs.vcf.gz",
log:
"logs/rename-contigs/{callset}.log",
conda:
Expand Down Expand Up @@ -195,12 +210,16 @@ rule calc_precision_recall:
calls="results/vcfeval/{callset}/{cov}/output.vcf.gz",
idx="results/vcfeval/{callset}/{cov}/output.vcf.gz.tbi",
common_src=common_src,
truth=get_stratified_truth(),
truth_idx=get_stratified_truth(".tbi"),
query="results/stratified-variants/{callset}/{cov}.vcf.gz",
query_index="results/stratified-variants/{callset}/{cov}.vcf.gz.tbi",
output:
snvs="results/precision-recall/callsets/{callset}/{cov}.{vartype}.tsv",
log:
"logs/calc-precision-recall/{callset}/{cov}/{vartype}.log",
# params:
# vaf_fields=get_vaf_fields,
params:
vaf_fields=get_vaf_fields,
conda:
"../envs/pysam.yaml"
script:
Expand Down Expand Up @@ -234,6 +253,7 @@ rule collect_precision_recall:
params:
callsets=lambda w: get_benchmark_callsets(w.benchmark),
labels=get_collect_precision_recall_labels,
vaf=get_vaf_status,
log:
"logs/collect-precision-recall/{benchmark}/{vartype}.log",
conda:
Expand All @@ -259,6 +279,7 @@ rule report_precision_recall:
"logs/datavzrd/precision-recall/{benchmark}/{vartype}.log",
params:
somatic=get_somatic_status,
vaf=get_vaf_status,
wrapper:
"v3.10.1/utils/datavzrd"

Expand Down
Loading
Loading