Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for generating mag samplesheet #544

Open
wants to merge 27 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
b0b37eb
Add chaining samplesheet with mag
sofstam Oct 10, 2024
590f9b1
Filter out correctly
sofstam Oct 14, 2024
738837e
Rename parameters
sofstam Oct 14, 2024
3d19176
Fix linting
sofstam Oct 14, 2024
80bed68
Fix linting
sofstam Oct 14, 2024
4a9fa88
Use same schema as createtaxdb
sofstam Oct 14, 2024
a3dea86
Use correct name of argument
sofstam Oct 14, 2024
b0ceef7
Add function
sofstam Oct 14, 2024
b5fe3db
Update nextflow_schema.json
sofstam Oct 14, 2024
a1cab25
Apply review suggestions
sofstam Oct 14, 2024
c315cae
Update docs/output.md
sofstam Oct 14, 2024
cd136d7
Review suggestions
sofstam Oct 14, 2024
7acbc4b
Merge branch 'generate-samplesheet' of https://github.com/sofstam/tax…
sofstam Oct 14, 2024
263c7d3
[automated] Fix code linting
nf-core-bot Oct 15, 2024
e3fa0ee
Add pattern to nextflow_schema.json
sofstam Oct 15, 2024
c6ac0cb
Prettier
sofstam Oct 15, 2024
aff979e
Review suggestions and new function
sofstam Oct 15, 2024
67f33e5
Update docs/output.md
sofstam Oct 16, 2024
94a28f4
Update docs/output.md
sofstam Oct 16, 2024
892b428
Remove tests folder
sofstam Oct 16, 2024
e0e83c3
Merge branch 'generate-samplesheet' of https://github.com/sofstam/tax…
sofstam Oct 16, 2024
0abfdf7
Use the same function as detaxizer
sofstam Oct 17, 2024
1e00c4c
LintinG
sofstam Oct 17, 2024
f79fc7d
[automated] Fix code linting
nf-core-bot Oct 22, 2024
7a25178
Apply suggestions from code review
jfy133 Oct 24, 2024
b729483
Get the samplesheet generate to generate se reads
jfy133 Oct 24, 2024
6eeb982
Fix run column
jfy133 Oct 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,10 @@ params {
kraken2_save_reads = true
centrifuge_save_reads = true
run_profile_standardisation = true

// Generate downstream samplesheets
generate_downstream_samplesheets = true
generate_pipeline_samplesheets = "differentialabundance,mag"
}

process {
Expand Down
4 changes: 4 additions & 0 deletions conf/test_nothing.config
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,10 @@ params {
run_motus = false
run_kmcp = false
run_ganon = false

// Generate downstream samplesheets
generate_downstream_samplesheets = false
generate_pipeline_samplesheets = "differentialabundance,mag"
}

process {
Expand Down
45 changes: 34 additions & 11 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,9 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution

The pipeline can also generate downstream pipeline input samplesheets.
These are stored in `<outdir>/downstream_samplesheets`.

![](images/taxprofiler_tube.png)

### untar
Expand Down Expand Up @@ -130,7 +133,7 @@ You can change the default value for low complexity filtering by using the argum

By default nf-core/taxprofiler will only provide the `.settings` file if AdapterRemoval is selected.

You will only find the `.fastq` files in the results directory if you provide ` --save_preprocessed_reads`. If this is selected, you may receive different combinations of `.fastq` files for each sample depending on the input types - e.g. whether you have merged or not, or if you're supplying both single- and paired-end reads. Alternatively, if you wish only to have the 'final' reads that go into classification/profiling (i.e., that may have additional processing), do not specify this flag but rather specify `--save_analysis_ready_reads`, in which case the reads will be in the folder `analysis_ready_reads`.
You will only find the `.fastq` files in the results directory if you provide ` --save_preprocessed_reads`. If this is selected, you may receive different combinations of `.fastq` files for each sample depending on the input types - e.g. whether you have merged or not, or if you're supplying both single- and paired-end reads. Alternatively, if you wish only to have the 'final' reads that go into classification/profiling (i.e., that may have additional processing), do not specify this flag but rather specify `--save_analysis_ready_fastqs`, in which case the reads will be in the folder `analysis_ready_reads`.

:::warning
The resulting `.fastq` files may _not_ always be the 'final' reads that go into taxprofiling, if you also run other steps such as complexity filtering, host removal, run merging etc..
Expand Down Expand Up @@ -174,7 +177,7 @@ The `.npo` files can be used for re-generating and customising the plots using t

The output logs are saved in the output folder and are part of MultiQC report.You do not normally need to check these manually.

You will only find the `.fastq` files in the results directory if you provide ` --save_preprocessed_reads`. Alternatively, if you wish only to have the 'final' reads that go into classification/profiling (i.e., that may have additional processing), do not specify this flag but rather specify `--save_analysis_ready_reads`, in which case the reads will be in the folder `analysis_ready_reads`.
You will only find the `.fastq` files in the results directory if you provide ` --save_preprocessed_reads`. Alternatively, if you wish only to have the 'final' reads that go into classification/profiling (i.e., that may have additional processing), do not specify this flag but rather specify `--save_analysis_ready_fastqs`, in which case the reads will be in the folder `analysis_ready_reads`.

:::warning
We do **not** recommend using Porechop if you are already trimming the adapters with ONT's basecaller Guppy.
Expand All @@ -195,7 +198,7 @@ We do **not** recommend using Porechop if you are already trimming the adapters

The output logs are saved in the output folder and are part of MultiQC report.You do not normally need to check these manually.

You will only find the `.fastq` files in the results directory if you provide ` --save_preprocessed_reads`. Alternatively, if you wish only to have the 'final' reads that go into classification/profiling (i.e., that may have additional processing), do not specify this flag but rather specify `--save_analysis_ready_reads`, in which case the reads will be in the folder `analysis_ready_reads`.
You will only find the `.fastq` files in the results directory if you provide ` --save_preprocessed_reads`. Alternatively, if you wish only to have the 'final' reads that go into classification/profiling (i.e., that may have additional processing), do not specify this flag but rather specify `--save_analysis_ready_fastqs`, in which case the reads will be in the folder `analysis_ready_reads`.

### BBDuk

Expand All @@ -212,7 +215,7 @@ It is used in nf-core/taxprofiler for complexity filtering using different algor

</details>

By default nf-core/taxprofiler will only provide the `.log` file if BBDuk is selected as the complexity filtering tool. You will only find the complexity filtered reads in your results directory if you provide ` --save_complexityfiltered_reads`. Alternatively, if you wish only to have the 'final' reads that go into classification/profiling (i.e., that may have additional processing), do not specify this flag but rather specify `--save_analysis_ready_reads`, in which case the reads will be in the folder `analysis_ready_reads`.
By default nf-core/taxprofiler will only provide the `.log` file if BBDuk is selected as the complexity filtering tool. You will only find the complexity filtered reads in your results directory if you provide ` --save_complexityfiltered_reads`. Alternatively, if you wish only to have the 'final' reads that go into classification/profiling (i.e., that may have additional processing), do not specify this flag but rather specify `--save_analysis_ready_fastqs`, in which case the reads will be in the folder `analysis_ready_reads`.

:::warning
The resulting `.fastq` files may _not_ always be the 'final' reads that go into taxprofiling, if you also run other steps such as host removal, run merging etc..
Expand All @@ -233,7 +236,7 @@ It is used in nf-core/taxprofiler for complexity filtering using different algor

</details>

By default nf-core/taxprofiler will only provide the `.log` file if PRINSEQ++ is selected as the complexity filtering tool. You will only find the complexity filtered `.fastq` files in your results directory if you supply ` --save_complexityfiltered_reads`. Alternatively, if you wish only to have the 'final' reads that go into classification/profiling (i.e., that may have additional processing), do not specify this flag but rather specify `--save_analysis_ready_reads`, in which case the reads will be in the folder `analysis_ready_reads`.
By default nf-core/taxprofiler will only provide the `.log` file if PRINSEQ++ is selected as the complexity filtering tool. You will only find the complexity filtered `.fastq` files in your results directory if you supply ` --save_complexityfiltered_reads`. Alternatively, if you wish only to have the 'final' reads that go into classification/profiling (i.e., that may have additional processing), do not specify this flag but rather specify `--save_analysis_ready_fastqs`, in which case the reads will be in the folder `analysis_ready_reads`.

:::warning
The resulting `.fastq` files may _not_ always be the 'final' reads that go into taxprofiling, if you also run other steps such as host removal, run merging etc..
Expand All @@ -252,7 +255,7 @@ The resulting `.fastq` files may _not_ always be the 'final' reads that go into

</details>

You will only find the `.fastq` files in the results directory if you provide ` --save_preprocessed_reads`. Alternatively, if you wish only to have the 'final' reads that go into classification/profiling (i.e., that may have additional processing), do not specify this flag but rather specify `--save_analysis_ready_reads`, in which case the reads will be in the folder `analysis_ready_reads`.
You will only find the `.fastq` files in the results directory if you provide ` --save_preprocessed_reads`. Alternatively, if you wish only to have the 'final' reads that go into classification/profiling (i.e., that may have additional processing), do not specify this flag but rather specify `--save_analysis_ready_fastqs`, in which case the reads will be in the folder `analysis_ready_reads`.

:::warning
We do _not_ recommend using Filtlong if you are performing filtering of low quality reads with ONT's basecaller Guppy.
Expand All @@ -271,7 +274,7 @@ We do _not_ recommend using Filtlong if you are performing filtering of low qual

</details>

You will only find the `.fastq` files in the results directory if you provide ` --save_preprocessed_reads`. Alternatively, if you wish only to have the 'final' reads that go into classification/profiling (i.e., that may have additional processing), do not specify this flag but rather specify `--save_analysis_ready_reads`, in which case the reads will be in the folder `analysis_ready_reads`.
You will only find the `.fastq` files in the results directory if you provide ` --save_preprocessed_reads`. Alternatively, if you wish only to have the 'final' reads that go into classification/profiling (i.e., that may have additional processing), do not specify this flag but rather specify `--save_analysis_ready_fastqs`, in which case the reads will be in the folder `analysis_ready_reads`.

### Bowtie2

Expand All @@ -292,7 +295,7 @@ It is used with nf-core/taxprofiler to allow removal of 'host' (e.g. human) and/

</details>

By default nf-core/taxprofiler will only provide the `.log` file if host removal is turned on. You will only have a `.bam` file if you specify `--save_hostremoval_bam`. This will contain _both_ mapped and unmapped reads. You will only get FASTQ files if you specify to save `--save_hostremoval_unmapped` - these contain only unmapped reads. Alternatively, if you wish only to have the 'final' reads that go into classification/profiling (i.e., that may have additional processing), do not specify this flag but rather specify `--save_analysis_ready_reads`, in which case the reads will be in the folder `analysis_ready_reads`.
By default nf-core/taxprofiler will only provide the `.log` file if host removal is turned on. You will only have a `.bam` file if you specify `--save_hostremoval_bam`. This will contain _both_ mapped and unmapped reads. You will only get FASTQ files if you specify to save `--save_hostremoval_unmapped` - these contain only unmapped reads. Alternatively, if you wish only to have the 'final' reads that go into classification/profiling (i.e., that may have additional processing), do not specify this flag but rather specify `--save_analysis_ready_fastqs`, in which case the reads will be in the folder `analysis_ready_reads`.

:::info
Unmapped reads in FASTQ are only found in this directory for short-reads, for long-reads see [`samtools/fastq/`](#samtools-fastq).
Expand Down Expand Up @@ -345,7 +348,7 @@ Unlike Bowtie2, minimap2 does not produce an unmapped FASTQ file by itself. See

</details>

This directory will be present and contain the unmapped reads from the `.fastq` format from long-read minimap2 host removal, if `--save_hostremoval_unmapped` is supplied. Alternatively, if you wish only to have the 'final' reads that go into classification/profiling (i.e., that may have additional processing), do not specify this flag but rather specify `--save_analysis_ready_reads`, in which case the reads will be in the folder `analysis_ready_reads`.
This directory will be present and contain the unmapped reads from the `.fastq` format from long-read minimap2 host removal, if `--save_hostremoval_unmapped` is supplied. Alternatively, if you wish only to have the 'final' reads that go into classification/profiling (i.e., that may have additional processing), do not specify this flag but rather specify `--save_analysis_ready_fastqs`, in which case the reads will be in the folder `analysis_ready_reads`.

:::info
For short-read unmapped reads, see [bowtie2](#bowtie2).
Expand All @@ -354,7 +357,7 @@ For short-read unmapped reads, see [bowtie2](#bowtie2).
### Analysis Ready Reads

:::info
This optional results directory will only be present in the pipeline results when supplying `--save_analysis_ready_reads`.
This optional results directory will only be present in the pipeline results when supplying `--save_analysis_ready_fastqs`.
:::

<details markdown="1">
Expand Down Expand Up @@ -401,7 +404,7 @@ This is the last possible preprocessing step, so if you have multiple runs or li

Note that you will only find samples that went through the run merging step in this directory. For samples that had a single run or library will not go through this step of the pipeline and thus will not be present in this directory.

This directory and its FASTQ files will only be present if you supply `--save_runmerged_reads`.Alternatively, if you wish only to have the 'final' reads that go into classification/profiling (i.e., that may have additional processing), do not specify this flag but rather specify `--save_analysis_ready_reads`, in which case the reads will be in the folder `analysis_ready_reads`.
This directory and its FASTQ files will only be present if you supply `--save_runmerged_reads`.Alternatively, if you wish only to have the 'final' reads that go into classification/profiling (i.e., that may have additional processing), do not specify this flag but rather specify `--save_analysis_ready_fastqs`, in which case the reads will be in the folder `analysis_ready_reads`.

### Bracken

Expand Down Expand Up @@ -744,3 +747,23 @@ For example, DIAMOND output does not have a dedicated section in the MultiQC HTM
</details>

[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

### Downstream samplesheets

The pipeline can also generate input files for the following downstream
pipelines:
sofstam marked this conversation as resolved.
Show resolved Hide resolved

- [nf-core/mag](https://nf-co.re/mag)

<details markdown="1">
<summary>Output files</summary>

- `downstream_samplesheets/`
- `mag.csv`: input sheet for nf-core/mag with paths to nf-core/taxprofiler preprocessed (corresponding to what is saved with `--save_analysis_ready_fastqs`)
sofstam marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing differential abundance? @LilyAnderssonLee do you thin kyou could merge your PR into this one at this stage?


</details>

:::warning
Any generated downstream samplesheet is provided as 'best effort' and are not guaranteed to work straight out of the box!
They may not be complete (e.g. some columns may need to be manually filled in).
:::
6 changes: 6 additions & 0 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,12 @@ params {
taxpasta_add_ranklineage = false
taxpasta_ignore_errors = false
standardisation_motus_generatebiom = false

// Generate downstream samplesheets

// Generate downstream samplesheets
generate_downstream_samplesheets = false
generate_pipeline_samplesheets = null
}

// Load base.config by default for all pipelines
Expand Down
23 changes: 22 additions & 1 deletion nextflow_schema.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"$schema": "http://json-schema.org/draft-07/schema",
"$schema": "https://json-schema.org/draft-07/schema",
"$id": "https://raw.githubusercontent.com/nf-core/taxprofiler/master/nextflow_schema.json",
"title": "nf-core/taxprofiler pipeline parameters",
"description": "Taxonomic classification and profiling of shotgun short- and long-read metagenomic data",
Expand Down Expand Up @@ -712,6 +712,24 @@
},
"fa_icon": "fas fa-chart-line"
},
"generate_samplesheet_options": {
"title": "Downstream pipeline samplesheet generation options",
"type": "object",
"fa_icon": "fas fa-university",
"description": "Options for generating input samplesheets for complementary downstream pipelines.",
"properties": {
"generate_pipeline_samplesheets": {
"type": "string",
"description": "Specify a comma separated string in quotes to specify which pipeline to generate a samplesheet for.",
"pattern": "^(differentialabundance|mag)(?:,(differentialabundance|mag)){0,1}"
},
"generate_downstream_samplesheets": {
"type": "boolean",
"description": "Turn on generation of samplesheets for downstream pipelines.",
"fa_icon": "fas fa-toggle-on"
}
}
},
"institutional_config_options": {
"title": "Institutional config options",
"type": "object",
Expand Down Expand Up @@ -945,6 +963,9 @@
}
},
"allOf": [
{
"$ref": "#/definitions/generate_samplesheet_options"
},
{
"$ref": "#/definitions/input_output_options"
},
Expand Down
60 changes: 60 additions & 0 deletions subworkflows/local/generate_downstream_samplesheets/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
//
// Subworkflow with functionality specific to the nf-core/mag pipeline
//

workflow SAMPLESHEET_MAG {
take:
ch_processed_reads

main:
format = 'csv' // most common format in nf-core
format_sep = ','


ch_list_for_samplesheet = ch_processed_reads
.filter { meta, sample_id, instrument_platform,fastq_1,fastq_2,fasta -> (fastq_1 && fastq_2) && !fasta }
.map {
meta, sample_id, instrument_platform,fastq_1,fastq_2,fasta ->
def sample = meta.id
def run = meta.run_accession //this should be optional
def group = ""
jfy133 marked this conversation as resolved.
Show resolved Hide resolved
def short_reads_1 = file(params.outdir).toString() + '/' + meta.id + '/' + fastq_1.getName()
def short_reads_2 = file(params.outdir).toString() + '/' + meta.id + '/' + fastq_2.getName()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you not also make this support single_end using the meta.single_end information?

Or have we lost that information at this stage?

Suggested change
def short_reads_1 = file(params.outdir).toString() + '/' + meta.id + '/' + fastq_1.getName()
def short_reads_2 = file(params.outdir).toString() + '/' + meta.id + '/' + fastq_2.getName()
def short_reads_1 = file(params.outdir).toString() + '/' + meta.id + '/' + fastq_1.getName()
def short_reads_2 = !meta.single_end ? file(params.outdir).toString() + '/' + meta.id + '/' + fastq_2.getName() : ""

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not follow here. We said to filter out the single ends and fasta. Or do I misunderstand something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, good point I forgot and there wasn't a comment why 😅

Please leave a comment next to the line saying only PE supported now, and I'll look into fixing that later :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sofstam did you see what @Joon-Klaps did in detaxizer for SE/PE reads? You can copy the same concept here too (basically to branch SE/PE separately, and call the channelToSamplesheet function twice ;)

def long_reads = ""
[sample: sample, run: run, group: group, short_reads_1: short_reads_1, short_reads_2: short_reads_2, long_reads: long_reads]
}
.tap { ch_colnames }

channelToSamplesheet(ch_list_for_samplesheet,"${params.outdir}/downstream_samplesheets/mag", format)
}

workflow GENERATE_DOWNSTREAM_SAMPLESHEETS {

take:
ch_processed_reads

main:
def downstreampipeline_names = params.generate_pipeline_samplesheets.split(",")

if ( downstreampipeline_names.contains('mag') && params.save_analysis_ready_fastqs) {
SAMPLESHEET_MAG(ch_processed_reads)
}

}

// Constructs the header string and then the strings of each row, and
def channelToSamplesheet(ch_list_for_samplesheet, path, format) {
format_sep = ["csv":",", "tsv":"\t", "txt":"\t"][format]

ch_header = ch_list_for_samplesheet

ch_header
.first()
.map{ it.keySet().join(format_sep) }
.concat( ch_list_for_samplesheet.map{ it.values().join(format_sep) })
.collectFile(
name:"${path}.${format}",
newLine: true,
sort: false
)
}
Loading
Loading