This repository contains the beta version of the v6.0 MGnify amplicon analysis pipeline. It is, first and foremost, a refactor of the existing v5.0 amplicon analysis pipeline, replacing CWL with Nextflow as its workflow management system. This pipeline re-implements all existing closed-reference v5.0 features, and makes multiple significant changes and additions.
The amplicon analysis pipeline v6.0 re-implements all of the existing features from v5.0:
- Reads quality control
- rRNA sequence extraction using Infernal/cmsearch
- Closed-reference-based taxonomic classification and visualiation of rRNA using MAPseq and Krona
The amplicon analysis pipeline v6.0 also contains multiple significant changes:
- Refactoring from CWL to Nextflow for pipeline definition
- Simplification the reads quality control using fastp
- Automatic amplified region inference for 16S and 18S rRNA
- Automatic primer identification, trimming, and validation
- Addition of Amplicon Sequence Variant (ASV) calling using DADA2
- Taxonomic classification and visualisation of ASVs using MAPseq and Krona to complement the existing closed-reference analysis
- Addition of PR2 as a reference database
- Updating of existing reference databases (SILVA, UNITE, ITSoneDB, Rfam)
At this stage, the only sequence amplicons that this pipeline is built for are:
Amplicon | Closed-reference analysis | ASV analysis |
---|---|---|
16S | β | β |
18S | β | β |
LSU | β | β |
ITS | β | β |
Tool | Version | Purpose |
---|---|---|
fastp | 0.23.4 | Read quality control |
seqtk | 1.3-r106 | FASTQ file manipulation |
easel | 0.49 | FASTA file manipulation |
bedtools | 2.30.0 | FASTA sequence masking |
Infernal/cmsearch | 1.1.5 | rRNA sequence searching |
cmsearch_tblout_deoverlap | 0.09 | Deoverlapping of cmsearch results |
MAPseq | 2.1.1b | Reference-based taxonomic classification of rRNA |
Krona | 2.8.1 | Krona chart visualisation |
cutadapt | 4.6 | Primer trimming |
R | 4.3.3 | R programming language (runs DADA2) |
DADA2 | 1.30.0 | ASV calling |
mgnify-pipelines-toolkit | 0.1.8 | Toolkit containing various in-house processing scripts |
This pipeline uses five different reference databases. The files the pipeline uses are processed from the raw files available on each database's website, for use by MAPseq and cmsearch. We provide ready-made versions of these processed files on our FTP, which you can find here:
Reference database | Version | Purpose | Processed file paths |
---|---|---|---|
SILVA | 138.1 | 16S+18S+LSU rRNA database | https://ftp.ebi.ac.uk/pub/databases/metagenomics/pipelines/tool-dbs/silva-ssu/ https://ftp.ebi.ac.uk/pub/databases/metagenomics/pipelines/tool-dbs/silva-lsu/ |
PR2 | 5.0 | Protist-focused 18S+16S rRNA database | https://ftp.ebi.ac.uk/pub/databases/metagenomics/pipelines/tool-dbs/pr2/ |
UNITE | 9.0 | ITS database | https://ftp.ebi.ac.uk/pub/databases/metagenomics/pipelines/tool-dbs/unite/ |
ITSoneDB | 1.141 | ITS database | https://ftp.ebi.ac.uk/pub/databases/metagenomics/pipelines/tool-dbs/itsonedb/ |
Rfam | 14.10 | RNA family profile database | https://ftp.ebi.ac.uk/pub/databases/metagenomics/pipelines/tool-dbs/rfam/ |
At the moment the only prerequisites for running it are Nextflow and Docker/Singularity, since all of the Nextflow processes use pre-built containers.
The input data for the pipeline is amplicon sequencing reads (either paired-end or single-end) in the form of FASTQ files. These files should be specified using a .csv
samplesheet file with this format:
sample,fastq_1,fastq_2,single_end
SRR9674618,/path/to/reads/SRR9674618.fastq.gz,,true
SRR17062740,/path/to/reads/SRR17062740_1.fastq.gz,/path/to/reads/SRR17062740_2.fastq.gz,false
You can run the current version of the pipeline on SLURM like this:
nextflow run ebi-metagenomics/amplicon-pipeline \
-r main \
-profile codon_slurm \
--input /path/to/samplesheet.csv \
--outdir /path/to/outputdir
Example output directory structure for one run (ERR4334351
):
βββ pipeline_info
βΒ Β βββ software_versions.yml
βββ ERR4334351
βΒ Β βββ taxonomy-summary
βΒ Β βΒ Β βββ UNITE
βΒ Β βΒ Β βΒ Β βββ ERR4334351_UNITE.txt
βΒ Β βΒ Β βΒ Β βββ ERR4334351_UNITE.tsv
βΒ Β βΒ Β βΒ Β βββ ERR4334351_UNITE.mseq
βΒ Β βΒ Β βΒ Β βββ ERR4334351.html
βΒ Β βΒ Β βββ SILVA-SSU
βΒ Β βΒ Β βΒ Β βββ ERR4334351_SILVA-SSU.txt
βΒ Β βΒ Β βΒ Β βββ ERR4334351_SILVA-SSU.tsv
βΒ Β βΒ Β βΒ Β βββ ERR4334351_SILVA-SSU.mseq
βΒ Β βΒ Β βΒ Β βββ ERR4334351.html
βΒ Β βΒ Β βββ PR2
βΒ Β βΒ Β βΒ Β βββ ERR4334351_PR2.txt
βΒ Β βΒ Β βΒ Β βββ ERR4334351_PR2.tsv
βΒ Β βΒ Β βΒ Β βββ ERR4334351_PR2.mseq
βΒ Β βΒ Β βΒ Β βββ ERR4334351.html
βΒ Β βΒ Β βββ ITSoneDB
βΒ Β βΒ Β βΒ Β βββ ERR4334351_ITSoneDB.txt
βΒ Β βΒ Β βΒ Β βββ ERR4334351_ITSoneDB.tsv
βΒ Β βΒ Β βΒ Β βββ ERR4334351_ITSoneDB.mseq
βΒ Β βΒ Β βΒ Β βββ ERR4334351.html
βΒ Β βΒ Β βββ DADA2-SILVA
βΒ Β βΒ Β βΒ Β βββ ERR4334351_DADA2-SILVA.mseq
βΒ Β βΒ Β βΒ Β βββ ERR4334351_16S-V3-V4.html
βΒ Β βΒ Β βΒ Β βββ ERR4334351_16S-V3-V4_DADA2-SILVA_asv_krona_counts.txt
βΒ Β βΒ Β βββ DADA2-PR2
βΒ Β βΒ Β βββ ERR4334351_DADA2-PR2.mseq
βΒ Β βΒ Β βββ ERR4334351_16S-V3-V4.html
βΒ Β βΒ Β βββ ERR4334351_16S-V3-V4_DADA2-PR2_asv_krona_counts.txt
βΒ Β βββ sequence-categorisation
βΒ Β βΒ Β βββ ERR4334351.tblout.deoverlapped
βΒ Β βΒ Β βββ ERR4334351_SSU_rRNA_bacteria.RF00177.fa
βΒ Β βΒ Β βββ ERR4334351_SSU_rRNA_archaea.RF01959.fa
βΒ Β βΒ Β βββ ERR4334351_SSU.fasta
βΒ Β βββ qc
βΒ Β βΒ Β βββ ERR4334351_suffix_header_err.json
βΒ Β βΒ Β βββ ERR4334351_seqfu.tsv
βΒ Β βΒ Β βββ ERR4334351_multiqc_report.html
βΒ Β βΒ Β βββ ERR4334351.merged.fastq.gz
βΒ Β βΒ Β βββ ERR4334351.fastp.json
βΒ Β βββ primer-identification
βΒ Β βΒ Β βββ ERR4334351_primer_validation.tsv
βΒ Β βΒ Β βββ ERR4334351_primers.fasta
βΒ Β βΒ Β βββ ERR4334351.cutadapt.json
βΒ Β βββ asv
βΒ Β βΒ Β βββ 16S-V3-V4
βΒ Β βΒ Β βΒ Β βββ ERR4334351_16S-V3-V4_asv_read_counts.tsv
βΒ Β βΒ Β βββ ERR4334351_dada2_stats.tsv
βΒ Β βΒ Β βββ ERR4334351_DADA2-SILVA_asv_tax.tsv
βΒ Β βΒ Β βββ ERR4334351_DADA2-PR2_asv_tax.tsv
βΒ Β βΒ Β βββ ERR4334351_asv_seqs.fasta
βΒ Β βββ amplified-region-inference
βΒ Β βββ ERR4334351.tsv
βΒ Β βββ ERR4334351.16S.V3-V4.txt
βββ study_multiqc_report.html
βββ qc_passed_runs.csv
βββ qc_failed_runs.csv
βββ primer_validation_summary.json
βββ manifest.json
For a more detailed description of the different output files, see the OUTPUTS_DESCRIPTION.md file.