MAGs generation pipeline

MGnify genomes generation pipeline to generate prokaryotic and eukaryotic MAGs from reads and assemblies.

This pipeline does not support co-binning.

Pipeline summary

The pipeline performs the following tasks:

Supports short reads.
Changes read headers to their corresponding assembly accessions (in the ERZ namespace).
Quality trims the reads, removes adapters fastp.

Afterward, the pipeline:

Runs a decontamination step using BWA to remove any host reads. By default, it uses the hg39.fna.
Bins the contigs using Concoct, MetaBAT2 and MaxBin2.
Refines the bins using the metaWRAP bin_refinement compatible subworkflow supported separately.

For prokaryotes:

Conducts bin quality control with CAT, GUNC, and CheckM.
Performs dereplication with dRep.
Calculates coverage using MetaBAT2 calculated depths.
Detects rRNA and tRNA using cmsearch.
Assigns taxonomy with GTDBtk.

For eukaryotes:

Estimates quality and merges bins using EukCC.
Dereplicates MAGs using dRep.
Calculates coverage using MetaBAT2 calculated depths.
Assesses quality with BUSCO and EukCC.
Assigns taxonomy with BAT.

Final steps:

Tools versions are available in software_versions.yml
Pipeline generates a tsv table for public MAG uploader
TODO: finish MultiQC

Usage

If this the first time running nextflow please refer to this page

Required reference databases

You need to download the mentioned databases and add them to config/dbs.config.

Don't forget to add this configuration to the main .nextflow.config.

BUSCO
CAT
CheckM
EukCC
GUNC
GTDB-Tk + ar53_metadata_r214.tsv, bac120_metadata_r214.tsv from here
Rfam
The reference genome of your choice for decontamination, as a .fasta.

Data download

If you use EBI cluster:

Get your Raw reads and Assembly study accessions;
Download data from ENA, get assembly and run_accessions and generate input samplesheet:

bash download_data/fetch_data.sh \
    -a assembly_study_accession \
    -r reads_study_accession \
    -c `pwd`/assembly_study_accession \
    -f "false"

Otherwise, download your data and keep format as recommended in Sample sheet example section below.

Run

nextflow run ebi-metagenomics/genomes-generation \
-profile <complete_with_profile> \
--input samplesheet.csv \
--assembly_software_file software.tsv \
--metagenome "metagenome" \
--biomes "biome,feature,material" \
--outdir <FULL_PATH_TO_OUTDIR>

Optional arguments

--skip_preprocessing_input (default=false): skip input data pre-processing step that renames ERZ-fasta files to ERR-run accessions. Useful if you process data not from ENA
--skip_prok (default=false): do not generate prokaryotic MAGs
--skip_euk (default=false): do not generate eukaryotic MAGs
--skip_concoct (default=false): skip CONCOCT binner in binning process
--skip_maxbin2 (default=false): skip MaxBin2 binner in binning process
--skip_metabat2 (default=false): skip METABAT2 binner in binning process
--merge_pairs (default=false): merge paired-end reads on QC step with fastp

Pipeline input data

Sample sheet example

Each row corresponds to a specific dataset with information such as an identifier for the row, the file path to the assembly, and paths to the raw reads files (fastq_1 and fastq_2). Additionally, the assembly_accession column contains ERZ-specific accessions associated with the assembly.

id	assembly	fastq_1	fastq_2	assembly_accession
SRR1631112	/path/to/ERZ1031893.fasta	/path/to/SRR1631112_1.fastq.gz	/path/to/SRR1631112_2.fastq.gz	ERZ1031893

There is example here

Assembly software

Id column is RUN accession
Assembly software is a tool that was used to assemble RUN into assembly (ERZ).

If you ran download_data/fetch_data.sh that file already exists in catalogue folder under name per_run_assembly.tsv. Otherwise, script can be helpful to collect that information from ENA.

id	assembly_software
SRR1631112	Assembler_vVersion

Metagenome

Manually choose the most appropriate metagenome from https://www.ebi.ac.uk/ena/browser/view/408169?show=tax-tree.

Biomes

Comma-separated environment parameters in format: "environment_biome,environment_feature,environment_material"

Pipeline output

Upload

Use final_table_for_uploader.tsv to upload your MAGs with uploader.

There is example here.

! Do not modify existing output structure because that TSV file contains full paths to your genomes.

Structure

final_table_for_uploader.tsv
unclassified_genomes.txt

bins
--- eukaryotes
------- run_accession
----------- bins.fa
--- prokaryotes
------- run_accession
----------- bins.fa

coverage
--- eukaryotes
------- coverage
----------- aggregated_contigs2bins.txt
------- run_accession_***_coverage
----------- coverage.tab
----------- ***_MAGcoverage.txt
--- prokaryotes
------- coverage
----------- aggregated_contigs2bins.txt
------- run_accession_***_coverage
----------- coverage.tab
----------- ***_MAGcoverage.txt

genomes_drep
--- eukaryotes
------- dereplicated_genomes.txt
------- genomes
----------- genomes.fa
--- prokaryotes
------- dereplicated_genomes.txt
------- genomes
----------- genomes.fa

intermediate_steps
--- binning
--- eukaryotes
------- eukcc
------- qs50
--- fastp
--- prokaryotes
------- gunc
------- refinement

rna
--- cluster_name
------- cluster_name_fasta
-----------  ***_rRNAs.fasta
------- cluster_name_out
----------- ***_rRNAs.out
----------- ***_tRNA_20aa.out

stats
--- eukaryotes
------- busco_final_qc.csv
------- combined_busco_eukcc.qc.csv
------- eukcc_final_qc.csv
--- prokaryotes
------- checkm2
----------- aggregated_all_stats.csv
----------- aggregated_filtered_genomes.tsv
------- checkm_results_mags.tab

taxonomy
--- eukaryotes
------- all_bin2classification.txt
------- human_readable.taxonomy.csv
--- prokaryotes
------- gtdbtk_results.tar.gz

pipeline_info
--- software_versions.yml

Citation

If you use this pipeline please make sure to cite all used software.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MAGs generation pipeline

Pipeline summary

Usage

Required reference databases

Data download

Run

Optional arguments

Pipeline input data

Sample sheet example

Assembly software

Metagenome

Biomes

Pipeline output

Upload

Structure

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

MAGs generation pipeline

Pipeline summary

Usage

Required reference databases

Data download

Run

Optional arguments

Pipeline input data

Sample sheet example

Assembly software

Metagenome

Biomes

Pipeline output

Upload

Structure

Citation