Skip to content

Commit

Permalink
Filter SNVs, INDELs, CNVs and SVs (#496)
Browse files Browse the repository at this point in the history
* Filter variants

* CHANGELOG

* Update CHANGELOG.md

* Update assets/schema_hgnc_ids.json

* Update subworkflows/local/filter_variants/main.nf

Co-authored-by: Daniel Schmitz <[email protected]>

* merge and review suggestions

* review suggestions

* Update subworkflows/local/filter_variants/main.nf

Co-authored-by: Anders Jemt <[email protected]>

* Review suggestions

---------

Co-authored-by: Daniel Schmitz <[email protected]>
Co-authored-by: Anders Jemt <[email protected]>
  • Loading branch information
3 people authored Nov 19, 2024
1 parent 4817e1f commit e9ff17c
Show file tree
Hide file tree
Showing 32 changed files with 958 additions and 38 deletions.
35 changes: 20 additions & 15 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [#451](https://github.com/genomic-medicine-sweden/nallo/pull/451) - Added support for running methylation subworkflow without phasing
- [#451](https://github.com/genomic-medicine-sweden/nallo/pull/451) - Added nf-test to methylation
- [#491](https://github.com/genomic-medicine-sweden/nallo/pull/491) - Added a changelog reminder action
- [#496](https://github.com/genomic-medicine-sweden/nallo/pull/496) - Added a subworkflow to filter variants

### `Changed`

Expand Down Expand Up @@ -132,6 +133,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
| `--validationSkipDuplicateCheck` | |
| `--validationS3PathCheck` | |
| `--monochromeLogs` | `--monochrome_logs` |
| | `--filter_variants_hgnc_ids` |
| | `--filter_snvs_expression` |
| | `--filter_svs_expression` |
| `--skip_short_variant_calling` | `--skip_snv_calling` |
| `--skip_assembly_wf` | `--skip_genome_assembly` |
| `--skip_mapping_wf` | `--skip_alignment` |
Expand Down Expand Up @@ -159,21 +163,22 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Module updates

| Tool | Old version | New version |
| -------------- | ----------- | ----------- |
| fqcrs | 0.1.0 |
| severus | | 1.1 |
| longphase  |   | 1.7.3   |
| genmod | 3.8.2 | 3.9 |
| WhatsHap | 2.2 | 2.3 |
| SVDB | | 2.8.1 |
| hifiasm | 0.19.8 | 0.20.0 |
| HiFiCNV | 0.1.7 | 1.0.0 |
| samtools/faidx | 1.2 | 1.21 |
| samtools/index | 1.2 | 1.21 |
| samtools/merge | 1.2 | 1.21 |
| stranger | 0.9.1 | 0.9.2 |
| multiqc | 1.21 | 1.25.1 |
| Tool | Old version | New version |
| --------------------- | ----------- | ----------- |
| fqcrs | 0.1.0 |
| severus | | 1.1 |
| longphase  |   | 1.7.3   |
| genmod | 3.8.2 | 3.9 |
| WhatsHap | 2.2 | 2.3 |
| SVDB | | 2.8.2 |
| hifiasm | 0.19.8 | 0.20.0 |
| HiFiCNV | 0.1.7 | 1.0.0 |
| samtools/faidx | 1.2 | 1.21 |
| samtools/index | 1.2 | 1.21 |
| samtools/merge | 1.2 | 1.21 |
| stranger | 0.9.1 | 0.9.2 |
| multiqc | 1.21 | 1.25.1 |
| ensemblvep/filter_vep | | 113 |

> [!NOTE]
> Version has been updated if both old and new version information is present.
Expand Down
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,11 @@

##### Ranking

- Rank SNVs, INDELs and SVs with [GENMOD](https://github.com/Clinical-Genomics/genmod)
- Rank SNVs, INDELs, SVs and CNVs with [GENMOD](https://github.com/Clinical-Genomics/genmod)

##### Filtering

- Filter SNVs, INDELs, SVs and CNVs with [filter_vep](https://www.ensembl.org/vep) and [bcftools view](https://samtools.github.io/bcftools/bcftools.html).

## Usage

Expand Down
26 changes: 26 additions & 0 deletions assets/schema_hgnc_ids.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://raw.githubusercontent.com/genomic-medicine-sweden/nallo/master/assets/schema_hgnc_ids.json",
"title": "genomic-medicine-sweden/nallo pipeline - params.filter_variants_hgnc_ids schema",
"description": "Schema for the file provided with params.filter_variants_hgnc_ids",
"type": "array",
"items": {
"type": "object",
"properties": {
"hgnc_id": {
"oneOf": [
{
"type": "string",
"pattern": "^\\S+$"
},
{
"type": "integer"
}
],
"exists": true,
"errorMessage": "HGNC IDs must exist with a header line `hgnc_id`, then one HGNC ID per line, either as e.g. `4826` or `HGNC:4826`."
}
},
"required": ["hgnc_id"]
}
}
77 changes: 77 additions & 0 deletions conf/modules/filter_variants.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Config file for defining DSL2 per module options and publishing paths
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Available keys to override module options:
ext.args = Additional arguments appended to command in module.
ext.args2 = Second set of arguments appended to command in module (multi-tool modules).
ext.args3 = Third set of arguments appended to command in module (multi-tool modules).
ext.prefix = File name prefix for output files.
----------------------------------------------------------------------------------------
*/

process {

/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Filter variants
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*/

withName: '.*:FILTER_VARIANTS_SNV:.*' {
publishDir = [
enabled: false,
]
}

withName: '.*:FILTER_VARIANTS_SNVS:ENSEMBLVEP_FILTERVEP' {
ext.args = { "--filter \"HGNC_ID in ${feature_file}\"" }
publishDir = [
enabled: false,
]
}

withName: '.*:FILTER_VARIANTS_SVS:ENSEMBLVEP_FILTERVEP' {
ext.args = { "--filter \"HGNC_ID in ${feature_file}\"" }
publishDir = [
enabled: false,
]
}

withName: '.*:FILTER_VARIANTS_SNVS:BCFTOOLS_VIEW' {
ext.prefix = { params.skip_snv_annotation ? "${meta.id}_snvs_filtered" : (params.skip_rank_variants ? "${meta.id}_snvs_annotated_filtered" : "${meta.id}_snvs_annotated_ranked_filtered") }
ext.args = { [
'--output-type z',
'--write-index=tbi',
"${params.filter_snvs_expression}"
].join(" ") }
publishDir = [
path: { "${params.outdir}/snvs/multi_sample/${meta.id}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: '.*:FILTER_VARIANTS_SVS:BCFTOOLS_VIEW' {
ext.prefix = {
def parts = []
parts << "${meta.id}"
parts << (params.skip_cnv_calling ? 'svs_merged' : 'svs_cnvs_merged')
if (!params.skip_sv_annotation) parts << 'annotated'
if (!params.skip_rank_variants) parts << 'ranked'
parts << 'filtered'
return parts.join('_')
}
ext.args = { [
'--output-type z',
'--write-index=tbi',
"${params.filter_svs_expression}"
].join(" ") }
publishDir = [
path: { "${params.outdir}/svs/family/${meta.id}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

}
3 changes: 2 additions & 1 deletion conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,13 @@ params {
modules_testdata_base_path = 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/'

// Base directory for genomic-medicine-sweden/nallo test data
pipelines_testdata_base_path = 'https://raw.githubusercontent.com/genomic-medicine-sweden/test-datasets/22fb5b8a1a358df96e49f8d01a9c6e18770fbd6d/'
pipelines_testdata_base_path = 'https://raw.githubusercontent.com/genomic-medicine-sweden/test-datasets/ba720cd29322036d966ab3e4bc4c3d03e1731af5/'

// References
fasta = params.pipelines_testdata_base_path + 'reference/hg38.test.fa.gz'
input = params.pipelines_testdata_base_path + 'testdata/samplesheet.csv'
target_regions = params.pipelines_testdata_base_path + 'reference/test_data.bed'
filter_variants_hgnc_ids = params.pipelines_testdata_base_path + 'testdata/hgnc_ids.tsv'
hificnv_expected_xy_cn = params.pipelines_testdata_base_path + 'reference/expected_cn.hg38.XY.bed'
hificnv_expected_xx_cn = params.pipelines_testdata_base_path + 'reference/expected_cn.hg38.XX.bed'
hificnv_excluded_regions = params.pipelines_testdata_base_path + 'reference/empty.bed'
Expand Down
6 changes: 5 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,11 @@ description: A bioinformatics analysis pipeline for long-reads from both PacBio

### Ranking

- Rank SNVs with [GENMOD](https://github.com/Clinical-Genomics/genmod)
- Rank SNVs, INDELs, SVs and CNVs with [GENMOD](https://github.com/Clinical-Genomics/genmod)

### Filtering

- Filter SNVs, INDELs, SVs and CNVs with [filter_vep](https://www.ensembl.org/vep) and [bcftools view](https://samtools.github.io/bcftools/bcftools.html).

## Usage

Expand Down
34 changes: 31 additions & 3 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,21 @@ If the pipeline is run with phasing, the aligned reads will be happlotagged usin
| `snvs/family/{family}/{family}_snv_annotated_ranked.vcf.gz` | VCF file with annotated and ranked variants per family |
| `snvs/family/{family}/{family}_snv_annotated_ranked.vcf.gz.tbi` | Index of the ranked VCF file |

[filter_vep](https://www.ensembl.org/vep) and [bcftools view](https://samtools.github.io/bcftools/bcftools.html) can be used to filter variants.

!!!note

Variants are only output if either of `--filter_variants_hgnc_id` and `--filter_snvs_expression` has been used, and only family VCFs are output.

!!!tip

Filtered variants are output alongside unfiltered variants as additional files.

| Path | Description |
| ---------------------------------------------- | -------------------------------------------- |
| `snvs/{family}/{family}_*_filtered.vcf.gz` | VCF file with filtered variants for a family |
| `snvs/{family}/{family}_*_filtered.vcf.gz.tbi` | Index of the filtered VCF file |

### SVs (and CNVs)

[Severus](https://github.com/KolmogorovLab/Severus) or [Sniffles](https://github.com/fritzsedlazeck/Sniffles) is used to call structural variants.
Expand All @@ -228,9 +243,7 @@ If the pipeline is run with phasing, the aligned reads will be happlotagged usin

!!!note

Due to the complexity of SV merging strategies, SVs and CNVs are reported per family rather than per project.
SV and CNV calls are output unmerged per sample, while the family files are first merged between samples for SVs and CNVs separately,
then the merged SV and CNV files are merged again, with priority given to coordinates from the SV calls.
SV and CNV calls are output unmerged per sample, while the family files are first merged between samples for SVs and CNVs separately, then the merged SV and CNV files are merged again, with priority given to coordinates from the SV calls.

| Path | Description |
| --------------------------------------------------------------- | ------------------------------------------------------------------ |
Expand Down Expand Up @@ -261,6 +274,21 @@ If the pipeline is run with phasing, the aligned reads will be happlotagged usin
| `svs/family/{family_id}/{family_id}_svs_merged_annotated_ranked.vcf.gz` | VCF file with merged, annotated and ranked SVs per family (output if CNV-calling is off) |
| `svs/family/{family_id}/{family_id}_svs_merged_annotated_ranked.vcf.gz.tbi` | Index of the merged VCF file |

[filter_vep](https://www.ensembl.org/vep) and [bcftools view](https://samtools.github.io/bcftools/bcftools.html) can be used to filter variants.

!!!note

Variants are only output if either of `--filter_variants_hgnc_id` and `--filter_svs_expression` has been used, and only family variants are output.

!!!tip

Filtered variants are output alongside unfiltered variants as additional files.

| Path | Description |
| ---------------------------------------------------- | -------------------------------------------- |
| `svs/family/{family}/{family}_*_filtered.vcf.gz` | VCF file with filtered variants for a family |
| `svs/family/{family}/{family}_*_filtered.vcf.gz.tbi` | Index of the filtered VCF file |

## Visualization Tracks

[HiFiCNV](https://github.com/PacificBiosciences/HiFiCNV) is used to call CNVs, but it also produces copy number, depth, and MAF tracks that can be visualized in for example IGV.
Expand Down
7 changes: 5 additions & 2 deletions docs/parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ Define where the pipeline should find input data and save output data.
| `genmod_score_config_snvs` | A SNV rank model config file for genmod. | `string` | | | |
| `genmod_score_config_svs` | A SV rank model config file for genmod. | `string` | | | |
| `somalier_sites` | A VCF of known polymorphic sites for somalier | `string` | | | |
| `pipelines_testdata_base_path` | Base URL or local path to location of pipeline test dataset files | `string` | https://raw.githubusercontent.com/genomic-medicine-sweden/test-datasets/22fb5b8a1a358df96e49f8d01a9c6e18770fbd6d/ | | True |
| `pipelines_testdata_base_path` | Base URL or local path to location of pipeline test dataset files | `string` | https://raw.githubusercontent.com/genomic-medicine-sweden/test-datasets/ba720cd29322036d966ab3e4bc4c3d03e1731af5/ | | True |

## Reference genome options

Expand Down Expand Up @@ -106,7 +106,10 @@ Workflow options specific to genomic-medicine-sweden/nallo
| `alignment_processes` | If alignment_processes is bigger than 1, input files will be split and aligned in parallel to reduce processing time. | `integer` | 8 | | |
| `snv_calling_processes` | If snv_calling_processes is bigger than 1, short variant calling will be done in parallel to reduce processing time. | `integer` | 13 | | |
| `vep_cache_version` | VEP cache version | `integer` | 110 | | |
| `vep_plugin_files` | A csv file with vep_plugin_files as header, and then paths to vep plugin files. Paths to pLI_values.txt and LoFtool_scores.txt are required. | `string` | | | |
| `vep_plugin_files` | A csv file with vep_files as header, and then paths to vep plugin files. Paths to pLI_values.txt and LoFtool_scores.txt are required. | `string` | | | |
| `filter_variants_hgnc_ids` | A tsv/csv file with a `#hgnc_ids` column header, and then one numerical HGNC ID per row. E.g. `4281`, not `HGNC:4281`. | `string` | | | |
| `filter_snvs_expression` | An expression that is passed to bcftools view to filter SNVs, e.g. --filter_snvs_expression "-e 'INFO/AQ>60'" | `string` | | | |
| `filter_svs_expression` | An expression that is passed to bcftools view to filter SVs, e.g. --filter_svs_expression "-e 'INFO/AQ>60'" | `string` | | | |
| `deepvariant_model_type` | Sets the model type used for DeepVariant. This is set automatically using `--preset` by default. | `string` | PACBIO | | True |
| `minimap2_read_mapping_preset` | Sets the minimap2-preset (-x) for read alignment. This is set automatically using the pipeline `--preset` by default. | `string` | map-hifi | | True |
| `extra_modkit_options` | Extra options to modkit, used for test profile. | `string` | | | True |
Expand Down
20 changes: 20 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,26 @@ This subworkflow ranks SVs, and relies on the mapping, SV calling and SV annotat

`--skip_rank_variants`.

#### Filter variants

SNVs and INDELs, and SVs and CNVs can be filtered using [filter_vep](https://www.ensembl.org/vep) and [bcftools view](https://samtools.github.io/bcftools/bcftools.html).

| Parameter | Description |
| --------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `filter_variants_hgnc_ids` <sup>1</sup> |  Used by filter_vep to filter variants on HGNC IDs. Requires a tsv/bed file with a `#hgnc_ids` column with one numerical HGNC ID per row. E.g. `4281`, not `HGNC:4281`. |

<sup>1</sup> Example file for input with `--filter_variants_hgnc_ids`:

```
#hgnc_id
4865
14150
```

To pass filters to bcftools view, use `--filter_snvs_expression` and `--filter_svs_expression`. E.g `--filter_snvs_expression "-e 'INFO/AQ>60'"`.

Filtering of variants only happens if any of these three parameters is active.

## Other highlighted parameters

- Limit SNV calling to regions in BED file (`--target_bed`).
Expand Down
5 changes: 5 additions & 0 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,11 @@
"git_sha": "2f9a5431355897e299cb41928c45f51ea8410c42",
"installed_by": ["modules"]
},
"ensemblvep/filtervep": {
"branch": "master",
"git_sha": "6e3585d9ad20b41adc7d271009f8cb5e191ecab4",
"installed_by": ["modules"]
},
"ensemblvep/vep": {
"branch": "master",
"git_sha": "6e3585d9ad20b41adc7d271009f8cb5e191ecab4",
Expand Down
5 changes: 5 additions & 0 deletions modules/nf-core/ensemblvep/filtervep/environment.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

49 changes: 49 additions & 0 deletions modules/nf-core/ensemblvep/filtervep/main.nf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit e9ff17c

Please sign in to comment.