Skip to content

Commit

Permalink
Add longphase (#388)
Browse files Browse the repository at this point in the history
* Add longphase

* Add tool to CHANGELOG

* Update parameters.md

* Update CHANGELOG.md

* Update main.nf.test.snap

* Revert back to using <TAG> for run example
  • Loading branch information
fellen31 authored Sep 25, 2024
1 parent f0b6e73 commit c3c2ae2
Show file tree
Hide file tree
Showing 34 changed files with 1,621 additions and 322 deletions.
13 changes: 9 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [#366](https://github.com/genomic-medicine-sweden/nallo/pull/366) - Added sorting of samples when creating PED files, so the output is always the same
- [#367](https://github.com/genomic-medicine-sweden/nallo/pull/367) - Added Severus as the default SV caller, together with a `--sv_caller` parameter to choose caller
- [#371](https://github.com/genomic-medicine-sweden/nallo/pull/371) - Added `FOUND_IN=caller` tags to SV output
- [#388](https://github.com/genomic-medicine-sweden/nallo/pull/388) - Added longphase as the default phaser
- [#388](https://github.com/genomic-medicine-sweden/nallo/pull/388) - Added single-sample tbi output to the short variant calling subworkflow
- [#393](https://github.com/genomic-medicine-sweden/nallo/pull/393) - Added a new `--minimap2_read_mapping_preset` parameter

### `Changed`
Expand All @@ -32,6 +34,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [#365](https://github.com/genomic-medicine-sweden/nallo/pull/365) - Changed CI to only use nf-test for pipeline tests
- [#381](https://github.com/genomic-medicine-sweden/nallo/pull/381) - Updated CI nf-test version to 0.9.0
- [#382](https://github.com/genomic-medicine-sweden/nallo/pull/382) - Changed vep_plugin_files description in schema and docs
- [#388](https://github.com/genomic-medicine-sweden/nallo/pull/388) - Changed phasing output structure and naming, and updated docs
- [#393](https://github.com/genomic-medicine-sweden/nallo/pull/393) - Changed the default minimap2 preset for PacBio data from `map-hifi` to `lr:hqae`
- [#397](https://github.com/genomic-medicine-sweden/nallo/pull/397) - Changed `pipelines_testdata_base_path` to pin a specific commit

Expand All @@ -40,6 +43,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [#352](https://github.com/genomic-medicine-sweden/nallo/pull/352) - Removed the fqcrs module
- [#356](https://github.com/genomic-medicine-sweden/nallo/pull/356) - Removed filter_vep section from output documentation since it is not in the pipeline
- [#379](https://github.com/genomic-medicine-sweden/nallo/pull/379) - Removed VEP Plugins from testdata ([genomic-medicine-sweden/test-datasets#16](https://github.com/genomic-medicine-sweden/test-datasets/pull/16))
- [#388](https://github.com/genomic-medicine-sweden/nallo/pull/388) - Removed support for co-phasing SVs with HiPhase, as the officially supported caller (pbsv) is not in the pipeline

### `Fixed`

Expand All @@ -64,10 +68,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Module updates

| Tool | Old version | New version |
| ------- | ----------- | ----------- |
| fqcrs | 0.1.0 |
| severus | | 1.1 |
| Tool | Old version | New version |
| ---------- | ----------- | ----------- |
| fqcrs | 0.1.0 |
| severus | | 1.1 |
| longphase  |   | 1.7.3   |

> [!NOTE]
> Version has been updated if both old and new version information is present.
Expand Down
4 changes: 4 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,10 @@
- [HiFiCNV](https://github.com/PacificBiosciences/HiFiCNV)

- [LongPhase](https://github.com/twolinin/longphase)

> Jyun-Hong Lin, Liang-Chi Chen, Shu-Chi Yu, Yao-Ting Huang, LongPhase: an ultra-fast chromosome-scale phasing algorithm for small and large variants, Bioinformatics, Volume 38, Issue 7, March 2022, Pages 1816–1822, https://doi.org/10.1093/bioinformatics/btac058
- [minimap2](https://academic.oup.com/bioinformatics/article/34/18/3094/4994778)

> Heng Li, Minimap2: pairwise alignment for nucleotide sequences, Bioinformatics, Volume 34, Issue 18, September 2018, Pages 3094–3100, https://doi.org/10.1093/bioinformatics/bty191
Expand Down
46 changes: 23 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,54 +3,53 @@
[![nf-test](https://img.shields.io/badge/unit_tests-nf--test-337ab7.svg)](https://www.nf-test.com)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13748210.svg)](https://doi.org/10.5281/zenodo.13748210)
[![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A523.04.0-23aa62.svg)](https://www.nextflow.io/)
[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
[![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)
[![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)
[![Launch on Seqera Platform](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Seqera%20Platform-%234256e7)](https://cloud.seqera.io/launch?pipeline=https://github.com/genomic-medicine-sweden/nallo)

## Introduction

**genomic-medicine-sweden/nallo** is a bioinformatics analysis pipeline for long-read rare disease SV/SNV identification using both PacBio and (targeted) ONT-data. Heavily influenced by best-practice pipelines such as [nf-core/nanoseq](https://github.com/nf-core/nanoseq), [nf-core/sarek](https://nf-co.re/sarek), [nf-core/raredisease](https://nf-co.re/raredisease), [PacBio Human WGS Workflow](https://github.com/PacificBiosciences/pb-human-wgs-workflow-snakemake), [epi2me-labs/wf-human-variation](https://github.com/epi2me-labs/wf-human-variation) and [brentp/rare-disease-wf](https://github.com/brentp/rare-disease-wf).
**genomic-medicine-sweden/nallo** is a bioinformatics analysis pipeline for long-reads from both PacBio and (targeted) ONT-data, focused on rare-disease. Heavily influenced by best-practice pipelines such as [nf-core/sarek](https://nf-co.re/sarek), [nf-core/raredisease](https://nf-co.re/raredisease), [nf-core/nanoseq](https://github.com/nf-core/nanoseq), [PacBio Human WGS Workflow](https://github.com/PacificBiosciences/pb-human-wgs-workflow-snakemake), [epi2me-labs/wf-human-variation](https://github.com/epi2me-labs/wf-human-variation) and [brentp/rare-disease-wf](https://github.com/brentp/rare-disease-wf).

## Pipeline summary
## Overview

<picture align="center">
<img alt="genomic-medicine-sweden/nallo workflow" src="docs/images/nallo_metromap.png">
</picture>

## Pipeline summary

##### QC

- FastQC ([`FastQC`](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
- Aligned read QC ([`cramino`](https://github.com/wdecoster/cramino))
- Depth information ([`mosdepth`](https://github.com/brentp/mosdepth))
- Read QC with [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/), [cramino](https://github.com/wdecoster/cramino) and [mosdepth](https://github.com/brentp/mosdepth)

##### Alignment & assembly

- Align reads to reference ([`minimap2`](https://github.com/lh3/minimap2))
- Assemble (trio-binned) haploid genomes (HiFi only) ([`hifiasm`](https://github.com/chhylp123/hifiasm))
- Align reads to reference with [minimap2](https://github.com/lh3/minimap2)
- Assemble (trio-binned) haploid genomes with [hifiasm](https://github.com/chhylp123/hifiasm) (HiFi only)

##### Variant calling

- Short variant calling & joint genotyping of SNVs ([`deepvariant`](https://github.com/google/deepvariant) + [`GLNexus`](https://github.com/dnanexus-rnd/GLnexus))
- SV calling with [Severus](https://github.com/KolmogorovLab/Severus) or [Sniffles2](https://github.com/fritzsedlazeck/Sniffles)
- Tandem repeats (HiFi only) ([`TRGT`](https://github.com/PacificBiosciences/trgt/tree/main))
- Assembly based variant calls (HiFi only) ([`dipcall`](https://github.com/lh3/dipcall))
- CNV-calling ([`HiFiCNV`](https://github.com/PacificBiosciences/HiFiCNV))
- Call paralogous genes ([`Paraphase`](https://github.com/PacificBiosciences/paraphase))
- Call SNVs & joint genotyping with [deepvariant](https://github.com/google/deepvariant) and [GLNexus](https://github.com/dnanexus-rnd/GLnexus)
- Call SVs with [Severus](https://github.com/KolmogorovLab/Severus) or [Sniffles2](https://github.com/fritzsedlazeck/Sniffles)
- Call CNVs with [HiFiCNV](https://github.com/PacificBiosciences/HiFiCNV)
- Call tandem repeats with [TRGT](https://github.com/PacificBiosciences/trgt/tree/main) (HiFi only)
- Call paralogous genes with [Paraphase](https://github.com/PacificBiosciences/paraphase)
- Call variants from assembly with [dipcall](https://github.com/lh3/dipcall) (HiFi only)

##### Phasing and methylation

- Phase and haplotag reads ([`whatshap`](https://github.com/whatshap/whatshap) + [`hiphase`](https://github.com/PacificBiosciences/HiPhase))
- Methylation pileups ([`modkit`](https://github.com/nanoporetech/modkit))
- Phase and haplotag reads with [LongPhase](https://github.com/twolinin/longphase), [whatshap](https://github.com/whatshap/whatshap) or [HiPhase](https://github.com/PacificBiosciences/HiPhase)
- Create methylation pileups with [modkit](https://github.com/nanoporetech/modkit)

##### Annotation

- Annotate SNVs and INDELs with database(s) of choice, i.e. [gnomAD](https://gnomad.broadinstitute.org), [CADD](https://cadd.gs.washington.edu) etc. ([`echtvar`](https://github.com/brentp/echtvar) and [`VEP`](https://github.com/Ensembl/ensembl-vep))
- Annotate SNVs and INDELs with databases of choice, i.e. [gnomAD](https://gnomad.broadinstitute.org), [CADD](https://cadd.gs.washington.edu) etc. with [echtvar](https://github.com/brentp/echtvar) and [VEP](https://github.com/Ensembl/ensembl-vep)
- Annotate repeat expansions with [stranger](https://github.com/Clinical-Genomics/stranger)

##### Filtering and ranking
##### Ranking

- Rank variants ([`GENMOD`](https://github.com/Clinical-Genomics/genmod))
- Rank SNVs with [GENMOD](https://github.com/Clinical-Genomics/genmod)

## Usage

Expand All @@ -63,14 +62,15 @@ Prepare a samplesheet with input data:

```
project,sample,file,family_id,paternal_id,maternal_id,sex,phenotype
testrun,HG002,/path/to/HG002.fastq.gz,FAM1,HG003,HG004,1,2
testrun,HG005,/path/to/HG005.bam,FAM1,HG003,HG004,2,1
NIST,HG002,/path/to/HG002.fastq.gz,FAM1,HG003,HG004,1,2
NIST,HG005,/path/to/HG005.bam,FAM1,HG003,HG004,2,1
```

Now, you can run the pipeline using:
Supply a reference genome with `--fasta` and choose a matching `--preset` for your data (`revio`, `pacbio`, `ONT_R10`). Now, you can run the pipeline using:

```bash
nextflow run genomic-medicine-sweden/nallo -profile YOURPROFILE \
nextflow run genomic-medicine-sweden/nallo \
-profile <docker/singularity/.../institute> \
--input samplesheet.csv \
--preset <revio/pacbio/ONT_R10> \
--fasta <reference.fasta> \
Expand Down
2 changes: 1 addition & 1 deletion conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ process {
maxRetries = 2
}

withName: '.*:SAMTOOLS_MERGE' {
withName: 'SAMTOOLS_MERGE|SAMTOOLS_INDEX' {
label = 'process_medium'
}
}
49 changes: 32 additions & 17 deletions conf/modules/phasing.config
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,7 @@ process {
]
}

withName: '.*:PHASING:HIPHASE_SNV' {
ext.prefix = { "$meta.id}_phased" }
withName: '.*:PHASING:HIPHASE' {
ext.args = { [
'--ignore-read-groups',
"--stats-file ${meta.id}_phased.stats.tsv",
Expand All @@ -35,22 +34,38 @@ process {
publishDir = [
path: { "${params.outdir}/" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : ((filename.endsWith('bam') || filename.endsWith('bai')) ? "aligned_reads/${meta.id}/${filename}" : "phasing/hiphase/snv/${meta.id}/${filename}" ) }
saveAs: { filename -> filename.equals('versions.yml') ? null : ((filename.endsWith('bam') || filename.endsWith('bai')) ? "aligned_reads/${meta.id}/${filename}" : "phased_variants/${meta.id}/${filename}" ) }
]
}

withName: '.*:PHASING:HIPHASE_SV' {
ext.prefix = { "$meta.id}_phased" }
ext.args = { [
'--ignore-read-groups',
"--stats-file ${meta.id}_phased.stats.tsv",
"--blocks-file ${meta.id}_phased.blocks.tsv",
"--summary-file ${meta.id}_phased.summary.tsv"
].join(' ') }
withName: '.*:PHASING:LONGPHASE_PHASE' {
ext.prefix = { "${meta.id}_phased" }
ext.args = [
params.preset.equals('ONT_R10') ? "--ont" : "--pb",
'--indels'
].join(' ')
publishDir = [
path: { "${params.outdir}/" },
path: { "${params.outdir}/phased_variants/${meta.id}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : ((filename.endsWith('bam') || filename.endsWith('bai')) ? "aligned_reads/${meta.id}/${filename}" : "phasing/hiphase/sv/${meta.id}/${filename}" ) }
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: '.*:PHASING:TABIX_LONGPHASE_PHASE' {
publishDir = [
path: { "${params.outdir}/phased_variants/${meta.id}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}


withName: '.*:PHASING:LONGPHASE_HAPLOTAG' {
ext.prefix = { "${meta.id}_haplotagged" }
publishDir = [
path: { "${params.outdir}/aligned_reads/${meta.id}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

Expand All @@ -61,7 +76,7 @@ process {
'--indels'
].join(' ')
publishDir = [
path: { "${params.outdir}/phasing/whatshap/phase/${meta.id}" },
path: { "${params.outdir}/phased_variants/${meta.id}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
Expand All @@ -70,14 +85,14 @@ process {
withName: '.*:PHASING:WHATSHAP_STATS' {
ext.prefix = { "${meta.id}_stats" }
publishDir = [
path: { "${params.outdir}/phasing/whatshap/stats/${meta.id}" },
path: { "${params.outdir}/qc/phasing_stats/${meta.id}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: '.*:PHASING:WHATSHAP_HAPLOTAG' {
ext.prefix = { "${meta.id}_phased" }
ext.prefix = { "${meta.id}_haplotagged" }
ext.args = [
'--ignore-read-groups',
'--tag-supplementary'
Expand All @@ -89,7 +104,7 @@ process {
]
}

withName: '.*:PHASING:SAMTOOLS_INDEX_WHATSHAP' {
withName: '.*:PHASING:SAMTOOLS_INDEX_WHATSHAP|.*:PHASING:SAMTOOLS_INDEX_LONGPHASE' {
publishDir = [
path: { "${params.outdir}/aligned_reads/${meta.id}" },
mode: params.publish_dir_mode,
Expand Down
3 changes: 2 additions & 1 deletion conf/modules/short_variant_calling.config
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,8 @@ process {
ext.args = [
'-m -',
'-w 10000',
'--output-type u',
'--output-type z',
'--write-index=tbi'
].join(' ')
}

Expand Down
30 changes: 6 additions & 24 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,40 +157,22 @@ Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQ

### Phasing

[WhatsHap](https://whatshap.readthedocs.io/en/latest/) or [HiPhase](https://github.com/PacificBiosciences/HiPhase) are used to phase variants and haplotag reads.
[LongPhase](https://github.com/twolinin/longphase), [WhatsHap](https://whatshap.readthedocs.io/en/latest/) or [HiPhase](https://github.com/PacificBiosciences/HiPhase) are used to phase variants and haplotag reads.

<details markdown="1">
<summary>Output files from WhatsHap</summary>
<summary>Output files from phasing</summary>

- `{outputdir}/aligned_reads/{sample}/`
- `{sample}_phased.bam`: BAM file with haplotags
- `{sample}_phased.bam.bai`: Index of the corresponding bam file
- `{outputdir}/phasing/whatshap/phase/{sample}/`
- `{sample}_haplotagged.bam`: BAM file with haplotags
- `{sample}_haplotagged.bam.bai`: Index of the corresponding bam file
- `{outputdir}/phased_variants/{sample}/`
- `*.vcf.gz`: VCF file with phased variants
- `*.vcf.gz.tbi`: Index of the corresponding VCF file
- `{outputdir}/phasing/whatshap/stats/{sample}/`
- `{outputdir}/qc/phasing_stats/{sample}/`
- `*.blocks.tsv`: File with phase blocks
- `*.stats.tsv`: File with phasing statistics
</details>

<details markdown="1">
<summary>Output files from HiPhase</summary>

- `{outputdir}/aligned_reads/{sample}/`

- `{sample}_phased.bam`: BAM file with haplotags
- `{sample}_phased.bam.bai`: Index of the corresponding bam file

- `{outputdir}/phasing/hiphase/{snv,sv}/{sample}/`

- `*.blocks.tsv`: File with phase blocks
- `*.stats.tsv.gz`: File with phasing statistics
- `*.vcf.gz`: VCF file with phased variants
- `*.vcf.gz.tbi`: Index of the corresponding VCF file
- `*.summary.tsv`: HiPhase summary file

</details>

### Pipeline information

[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.
Expand Down
2 changes: 1 addition & 1 deletion docs/parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ Workflow options specific to genomic-medicine-sweden/nallo
| `preset` | Enable or disable certain parts of the pipeline by default, depending on data type (`revio`, `pacbio`, `ONT_R10`) | `string` | revio | True | |
| `variant_caller` | Which short variant software to use (`deepvariant`) | `string` | deepvariant | | |
| `sv_caller` | Which structural variant caller to use (`severus`, `sniffles`) | `string` | severus | | |
| `phaser` | Which phasing software to use (`whatshap`, `hiphase_snv`, `hiphase_sv`) | `string` | whatshap | | |
| `phaser` | Which phasing software to use (`longphase`, `whatshap`, `hiphase`) | `string` | longphase | | |
| `hifiasm_mode` | Run hifiasm in hifi-only or hifi-trio mode (`hifi-only`, `trio-binning`) | `string` | hifi-only | | |
| `parallel_alignments` | If parallel_alignments is bigger than 1, input files will be split and aligned in parallel to reduce processing time. | `integer` | 1 | | |
| `parallel_snv` | If parallel_snv is bigger than 1, short variant calling will be done in parallel to reduce processing time. | `integer` | 13 | | |
Expand Down
10 changes: 10 additions & 0 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,16 @@
"git_sha": "aecb06fcdb995ff3e3df7c7a1fd119367d6d1996",
"installed_by": ["modules"]
},
"longphase/haplotag": {
"branch": "master",
"git_sha": "06c8865e36741e05ad32ef70ab3fac127486af48",
"installed_by": ["modules"]
},
"longphase/phase": {
"branch": "master",
"git_sha": "06c8865e36741e05ad32ef70ab3fac127486af48",
"installed_by": ["modules"]
},
"minimap2/align": {
"branch": "master",
"git_sha": "a33ef9475558c6b8da08c5f522ddaca1ec810306",
Expand Down
4 changes: 2 additions & 2 deletions modules/local/hiphase/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ process HIPHASE {
vcfInputs.add('--vcf')
vcfInputs.add(vcf)
vcfOutputs.add('--output-vcf')
vcfOutputs.add("${prefix}.vcf.gz")
vcfOutputs.add("${prefix}_phased.vcf.gz")

vcfNames.add(vcf.getName())
}
Expand All @@ -58,7 +58,7 @@ process HIPHASE {

if(output_bam) {
bamOutputs.add('--output-bam')
bamOutputs.add("${prefix}.bam")
bamOutputs.add("${prefix}_haplotagged.bam")
}
}

Expand Down
7 changes: 7 additions & 0 deletions modules/nf-core/longphase/haplotag/environment.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit c3c2ae2

Please sign in to comment.