Merge pull request #519 from nf-core/bouncy-basenji

Bouncy basenji pre-release PR
nf-core · Sep 11, 2024 · b63da73 · b63da73
2 parents 5e0d556 + cc34d41
commit b63da73
Show file tree

Hide file tree

Showing 329 changed files with 15,411 additions and 1,131 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -65,8 +65,10 @@ jobs:
           if [[ "${{ matrix.tags }}" == "test_motus" ]]; then
             wget https://raw.githubusercontent.com/motu-tool/mOTUs/master/motus/downloadDB.py
             python downloadDB.py --no-download-progress
-            echo 'tool,db_name,db_params,db_path' > 'database_motus.csv'
-            echo "motus,db_mOTU,,db_mOTU" >> 'database_motus.csv'
+            echo 'tool,db_name,db_params,db_type,db_path' > 'database_motus.csv'
+            echo "motus,db1_mOTU,,short,db_mOTU" >> 'database_motus.csv'
+            echo "motus,db2_mOTU,,long,db_mOTU" >> 'database_motus.csv'
+            echo "motus,db3_mOTU,,short;long,db_mOTU" >> 'database_motus.csv'
             nextflow run ${GITHUB_WORKSPACE} -profile docker,${{ matrix.tags }} --databases ./database_motus.csv --outdir ./results_${{ matrix.tags }};
           else
             nextflow run ${GITHUB_WORKSPACE} -profile docker,${{ matrix.tags }} --outdir ./results_${{ matrix.tags }};

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,16 +3,42 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## dev - [unreleased]
+## v1.2dev - Bouncy Basenji [unreleased]
 
 ### `Added`
 
+- [#417](https://github.com/nf-core/taxprofiler/pull/417) - Added reference-free metagenome estimation with Nonpareil (added by @jfy133)
+- [#466](https://github.com/nf-core/taxprofiler/pull/466) - Input database sheets now require a `db_type` column to distinguish between short- and long-read databases (added by @LilyAnderssonLee)
+- [#505](https://github.com/nf-core/taxprofiler/pull/505) - Add small files to the file `tower.yml` (added by @LilyAnderssonLee)
+- [#508](https://github.com/nf-core/taxprofiler/pull/508) - Add `nanoq` as a filtering tool for nanopore reads (added by @LilyAnderssonLee)
+- [#511](https://github.com/nf-core/taxprofiler/pull/511) - Add `porechop_abi` as an alternative adapter removal tool for long reads nanopore data (added by @LilyAnderssonLee)
+- [#512](https://github.com/nf-core/taxprofiler/pull/512) - Update all tools to the latest version and include nf-test (Updated by @LilyAnderssonLee & @jfy133)
+
 ### `Fixed`
 
 - [#518](https://github.com/nf-core/taxprofiler/pull/518) Fixed a bug where Oxford Nanopore FASTA input files would not be processed (❤️ to @ikarls for reporting, fixed by @jfy133)
 
 ### `Dependencies`
 
+| Tool          | Previous version | New version |
+| ------------- | ---------------- | ----------- |
+| bbmap         | 39.01            | 39.06       |
+| bowtie2       | 2.4.4            | 2.5.2       |
+| bracken       | 2.7              | 2.9         |
+| cat/fastq     | 8.30             |
+| diamond       | 2.0.15           | 2.1.8       |
+| ganon         | 1.5.1            | 2.0.0       |
+| kraken2       | 2.1.2            | 2.1.3       |
+| krona         | 2.8              | 2.8.1       |
+| megan         | 6.24.20          | 6.25.9      |
+| metaphlan     | 4.0.6            | 4.1.1       |
+| minimap2      | 2.24             | 2.28        |
+| motus/profile | 3.0.3            | 3.1.0       |
+| multiqc       | 1.21             | 1.24.1      |
+| nanoq         |                  | 0.10.0      |
+| samtools      | 1.17             | 1.20        |
+| untar         | 4.7              | 4.8         |
+
 ### `Deprecated`
 
 ## v1.1.8 - Augmented Akita Patch [2024-06-20]

diff --git a/CITATIONS.md b/CITATIONS.md
@@ -30,14 +30,26 @@
 
   > Schubert, M., Lindgreen, S., & Orlando, L. (2016). AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC Research Notes, 9, 88. https://doi.org/10.1186/s13104-016-1900-2
 
+- [Nonpareil](https://doi.org/10.1128/mSystems.00039-18)
+
+  - Rodriguez-R, L. M., Gunturu, S., Tiedje, J. M., Cole, J. R., & Konstantinidis, K. T. (2018). Nonpareil 3: Fast Estimation of Metagenomic Coverage and Sequence Diversity. mSystems, 3(3). https://doi.org/10.1128/mSystems.00039-18
+
 - [Porechop](https://github.com/rrwick/Porechop)
 
   > Wick, R. R., Judd, L. M., Gorrie, C. L., & Holt, K. E. (2017). Completing bacterial genome assemblies with multiplex MinION sequencing. Microbial Genomics, 3(10), e000132. https://doi.org/10.1099/mgen.0.000132
 
+- [Porechop_ABI](https://github.com/bonsai-team/Porechop_ABI)
+
+  > Bonenfant, Q., Noé, L., & Touzet, H. (2023). Porechop_ABI: discovering unknown adapters in Oxford Nanopore Technology sequencing reads for downstream trimming. Bioinformatics Advances, 3(1):vbac085. https://10.1093/bioadv/vbac085
+
 - [Filtlong](https://github.com/rrwick/Filtlong)
 
   > Wick R (2021) Filtlong, URL: https://github.com/rrwick/Filtlong
 
+- [nanoq](https://github.com/esteinig/nanoq)
+
+  > Steinig, E., & Coin, L. (2022). Nanoq: ultra-fast quality control for nanopore reads. Journal of Open Source Software, 7(69). https://doi.org/10.21105/joss.02991
+
 - [BBTools](http://sourceforge.net/projects/bbmap/)
 
   > Bushnell B. (2022) BBMap, URL: http://sourceforge.net/projects/bbmap/

diff --git a/README.md b/README.md
@@ -23,17 +23,21 @@
 
 **nf-core/taxprofiler** is a bioinformatics best-practice analysis pipeline for taxonomic classification and profiling of shotgun short- and long-read metagenomic data. It allows for in-parallel taxonomic identification of reads or taxonomic abundance estimation with multiple classification and profiling tools against multiple databases, and produces standardised output tables for facilitating results comparison between different tools and databases.
 
+The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from [nf-core/modules](https://github.com/nf-core/modules) in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!
+
+On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the [nf-core website](https://nf-co.re/scnanoseq/results).
+
 ## Pipeline summary
 
 ![](docs/images/taxprofiler_tube.png)
 
 1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) or [`falco`](https://github.com/smithlabcode/falco) as an alternative option)
 2. Performs optional read pre-processing
-   - Adapter clipping and merging (short-read: [fastp](https://github.com/OpenGene/fastp), [AdapterRemoval2](https://github.com/MikkelSchubert/adapterremoval); long-read: [porechop](https://github.com/rrwick/Porechop))
-   - Low complexity and quality filtering (short-read: [bbduk](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/), [PRINSEQ++](https://github.com/Adrian-Cantu/PRINSEQ-plus-plus); long-read: [Filtlong](https://github.com/rrwick/Filtlong))
+   - Adapter clipping and merging (short-read: [fastp](https://github.com/OpenGene/fastp), [AdapterRemoval2](https://github.com/MikkelSchubert/adapterremoval); long-read: [porechop](https://github.com/rrwick/Porechop), [Porechop_ABI](https://github.com/bonsai-team/Porechop_ABI))
+   - Low complexity and quality filtering (short-read: [bbduk](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/), [PRINSEQ++](https://github.com/Adrian-Cantu/PRINSEQ-plus-plus); long-read: [Filtlong](https://github.com/rrwick/Filtlong)), [Nanoq](https://github.com/esteinig/nanoq)
    - Host-read removal (short-read: [BowTie2](http://bowtie-bio.sourceforge.net/bowtie2/); long-read: [Minimap2](https://github.com/lh3/minimap2))
    - Run merging
-3. Supports statistics for host-read removal ([Samtools](http://www.htslib.org/))
+3. Supports statistics metagenome coverage estimation ([Nonpareil](https://nonpareil.readthedocs.io/en/latest/)) and for host-read removal ([Samtools](http://www.htslib.org/))
 4. Performs taxonomic classification and/or profiling using one or more of:
    - [Kraken2](https://ccb.jhu.edu/software/kraken2/)
    - [MetaPhlAn](https://huttenhower.sph.harvard.edu/metaphlan/)

diff --git a/assets/multiqc_config.yml b/assets/multiqc_config.yml
@@ -11,6 +11,48 @@ report_section_order:
     order: -1001
   "nf-core-taxprofiler-summary":
     order: -1002
+  general_stats":
+    order: 1000
+  fastqc:
+    order: 900
+  fastqc-1:
+    order: 800
+  fastp:
+    order: 700
+  adapterRemoval:
+    order: 600
+  nonpareil:
+    order: 500
+  porechop:
+    order: 400
+  porechop_abi:
+    order: 450
+  bbduk:
+    order: 300
+  prinseqplusplus:
+    order: 200
+  filtlong:
+    order: 100
+  nanoq:
+    order: 95
+  bowtie2:
+    order: 90
+  samtools:
+    order: 80
+  kraken:
+    order: 70
+  bracken:
+    order: 60
+  centrifuge:
+    order: 50
+  malt:
+    order: 40
+  diamond:
+    order: 30
+  kaiju:
+    order: 20
+  motus:
+    order: 10
 
 export_plots: true
 
@@ -22,11 +64,13 @@ custom_logo_title: "nf-core/taxprofiler"
 run_modules:
   - fastqc
   - adapterRemoval
-  - fastp
+    - fastp
+    - nonpareil
   - bbduk
   - prinseqplusplus
   - porechop
   - filtlong
+  - nanoq
   - bowtie2
   - minimap2
   - samtools
@@ -44,6 +88,8 @@ sp:
     fn_re: ".*(fastqc|falco)_data.txt$"
   fastqc/zip:
     fn: "*_fastqc.zip"
+  nonpareil:
+    fn: "nonpareil_all_samples.json"
 
 top_modules:
   - "fastqc":
@@ -60,13 +106,23 @@ top_modules:
       path_filters_exclude:
         - "*raw*"
       extra: "If used in this run, Falco is a drop-in replacement for FastQC producing the same output, written by Guilherme de Sena Brandine and Andrew D. Smith."
-  - "fastp"
-  - "adapterRemoval"
+  - nonpareil
   - "porechop":
+      name: "Porechop"
+      anchor: "porechop"
+      target: "Porechop"
+      path_filters:
+        - "*porechop.log"
       extra: "ℹ️: if you get the error message 'Error - was not able to plot data.' this means that porechop did not detect any adapters and therefore no statistics generated."
-  - "bbduk"
-  - "prinseqplusplus"
-  - "filtlong"
+  - "porechop":
+      name: "Porechop_ABI"
+      anchor: "porechop_abi"
+      target: "Porechop_ABI"
+      doi: "10.1093/bioadv/vbac085"
+      info: "find and remove adapters from Oxford Nanopore reads."
+      path_filters:
+        - "*porechop_abi.log"
+      extra: "ℹ️: if you get the error message 'Error - was not able to plot data.' this means that porechop_abi did not detect any adapters and therefore no statistics generated."
   - "bowtie2":
       name: "bowtie2"
   - "samtools":
@@ -95,12 +151,11 @@ top_modules:
         - "*.centrifuge.txt"
   - "malt":
       name: "MALT"
-  - "diamond"
   - "kaiju":
       name: "Kaiju"
-  - "motus"
 
-#It is not possible to set placement for custom kraken and centrifuge columns.
+# It is not possible to set placement for custom kraken
+# and centrifuge columns.
 
 table_columns_placement:
   FastQC / Falco (pre-Trimming):
@@ -130,16 +185,32 @@ table_columns_placement:
     percent_aligned: 370
     percent_collapsed: 380
     percent_discarded: 390
+  nonpareil:
+    nonpareil_R: 400
+    nonpareil_LR: 410
+    nonpareil_kappa: 420
+    nonpareil_C: 430
+    nonpareil_diversity: 440
   Porechop:
-    Input Reads: 400
-    Start Trimmed: 410
-    Start Trimmed Percent: 420
-    End Trimmed: 430
-    End Trimmed Percent: 440
-    Middle Split: 450
-    Middle Split Percent: 460
+    Input Reads: 500
+    Start Trimmed: 510
+    Start Trimmed Percent: 520
+    End Trimmed: 530
+    End Trimmed Percent: 540
+    Middle Split: 550
+    Middle Split Percent: 560
+  Porechop_ABI:
+    Input Reads: 500
+    Start Trimmed: 510
+    Start Trimmed Percent: 520
+    End Trimmed: 530
+    End Trimmed Percent: 540
+    Middle Split: 550
+    Middle Split Percent: 560
   Filtlong:
-    Target bases: 500
+    Target bases: 600
+  nanoq:
+    Read N50: 700
   BBDuk:
     Input reads: 800
     Total Removed bases percent: 810
@@ -203,6 +274,24 @@ table_columns_visible:
     percent_duplicates: False
     percent_gc: False
     percent_fails: False
+  Adapter Removal:
+    aligned_total: True
+    percent_aligned: True
+    percent_collapsed: True
+    percent_discarded: False
+  fastp:
+    pct_adapter: True
+    pct_surviving: True
+    pct_duplication: False
+    after_filtering_gc_content: False
+    after_filtering_q30_rate: False
+    after_filtering_q30_bases: False
+  nonpareil:
+    nonpareil_R: false
+    nonpareil_LR: false
+    nonpareil_kappa: true
+    nonpareil_C: true
+    nonpareil_diversity: true
   porechop:
     Input reads: False
     Start Trimmed:
@@ -211,20 +300,18 @@ table_columns_visible:
     End Trimmed Percent: True
     Middle Split: False
     Middle Split Percent: True
-  fastp:
-    pct_adapter: True
-    pct_surviving: True
-    pct_duplication: False
-    after_filtering_gc_content: False
-    after_filtering_q30_rate: False
-    after_filtering_q30_bases: False
+  porechop_abi:
+    Input reads: False
+    Start Trimmed:
+    Start Trimmed Percent: True
+    End Trimmed: False
+    End Trimmed Percent: True
+    Middle Split: False
+    Middle Split Percent: True
   Filtlong:
     Target bases: True
-  Adapter Removal:
-    aligned_total: True
-    percent_aligned: True
-    percent_collapsed: True
-    percent_discarded: False
+  nanoq:
+    ReadN50: True
   BBDuk:
     Input reads: False
     Total Removed bases Percent: False
@@ -276,6 +363,9 @@ extra_fn_clean_exts:
   - ".bbduk"
   - ".unmapped"
   - "_filtered"
+  - "porechop"
+  - "porechop_abi"
+  - "_processed"
   - type: remove
     pattern: "_falco"
 

diff --git a/assets/schema_database.json b/assets/schema_database.json
@@ -39,6 +39,12 @@
                 "errorMessage": "Invalid database db_params entry. No quotes allowed.",
                 "meta": ["db_params"]
             },
+            "db_type": {
+                "type": "string",
+                "enum": ["short", "long", "short;long"],
+                "default": "short;long",
+                "meta": ["db_type"]
+            },
             "db_path": {
                 "type": "string",
                 "exists": true,

diff --git a/assets/schema_input.json b/assets/schema_input.json
@@ -38,18 +38,21 @@
                 "type": "string",
                 "format": "file-path",
                 "pattern": "^\\S+\\.f(ast)?q\\.gz$",
+                "unique": true,
                 "errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
             },
             "fastq_2": {
                 "type": "string",
                 "format": "file-path",
                 "pattern": "^\\S+\\.f(ast)?q\\.gz$",
+                "unique": true,
                 "errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'. If not applicable, leave it empty."
             },
             "fasta": {
                 "type": "string",
                 "format": "file-path",
                 "pattern": "^\\S+\\.(f(ast)?q|fa(sta)?)\\.gz$",
+                "unique": true,
                 "errorMessage": "FastA file must be provided, cannot contain spaces and must have extension '.fa.gz' or '.fasta.gz'. If not applicable, leave it empty."
             }
         },