MGnify genomes generation pipeline to generate prokaryotic and eukaryotic MAGs from reads and assemblies.
This pipeline does not support co-binning.
The pipeline performs the following tasks:
- Supports short reads.
- Changes read headers to their corresponding assembly accessions (in the ERZ namespace).
- Quality trims the reads, removes adapters fastp.
Afterward, the pipeline:
- Runs a decontamination step using BWA to remove any host reads. By default, it uses the hg39.fna.
- Bins the contigs using Concoct, MetaBAT2 and MaxBin2.
- Refines the bins using the metaWRAP bin_refinement compatible subworkflow supported separately.
For prokaryotes:
- Conducts bin quality control with CAT, GUNC, and CheckM.
- Performs dereplication with dRep.
- Calculates coverage using MetaBAT2 calculated depths.
- Detects rRNA and tRNA using cmsearch.
- Assigns taxonomy with GTDBtk.
For eukaryotes:
- Estimates quality and merges bins using EukCC.
- Dereplicates MAGs using dRep.
- Calculates coverage using MetaBAT2 calculated depths.
- Assesses quality with BUSCO and EukCC.
- Assigns taxonomy with BAT.
Final steps:
- Tools versions are available in software_versions.yml
- Pipeline generates a tsv table for public MAG uploader
- TODO: finish MultiQC
If this the first time running nextflow please refer to this page
You need to download the mentioned databases and add them to config/dbs.config.
Don't forget to add this configuration to the main .nextflow.config
.
- BUSCO
- CAT
- CheckM
- EukCC
- GUNC
- GTDB-Tk + ar53_metadata_r214.tsv, bac120_metadata_r214.tsv from here
- Rfam
- The reference genome of your choice for decontamination, as a .fasta.
If you use EBI cluster:
- Get your Raw reads and Assembly study accessions;
- Download data from ENA, get assembly and run_accessions and generate input samplesheet:
bash download_data/fetch_data.sh \
-a assembly_study_accession \
-r reads_study_accession \
-c `pwd`/assembly_study_accession \
-f "false"
Otherwise, download your data and keep format as recommended in Sample sheet example section below.
nextflow run ebi-metagenomics/genomes-generation \
-profile <complete_with_profile> \
--input samplesheet.csv \
--assembly_software_file software.tsv \
--metagenome "metagenome" \
--biomes "biome,feature,material" \
--outdir <FULL_PATH_TO_OUTDIR>
--skip_preprocessing_input (default=false)
: skip input data pre-processing step that renames ERZ-fasta files to ERR-run accessions. Useful if you process data not from ENA--skip_prok (default=false)
: do not generate prokaryotic MAGs--skip_euk (default=false)
: do not generate eukaryotic MAGs--skip_concoct (default=false)
: skip CONCOCT binner in binning process--skip_maxbin2 (default=false)
: skip MaxBin2 binner in binning process--skip_metabat2 (default=false)
: skip METABAT2 binner in binning process--merge_pairs (default=false)
: merge paired-end reads on QC step with fastp
Each row corresponds to a specific dataset with information such as an identifier for the row, the file path to the assembly, and paths to the raw reads files (fastq_1 and fastq_2). Additionally, the assembly_accession column contains ERZ-specific accessions associated with the assembly.
id | assembly | fastq_1 | fastq_2 | assembly_accession |
---|---|---|---|---|
SRR1631112 | /path/to/ERZ1031893.fasta | /path/to/SRR1631112_1.fastq.gz | /path/to/SRR1631112_2.fastq.gz | ERZ1031893 |
There is example here
Id column is RUN accession
Assembly software is a tool that was used to assemble RUN into assembly (ERZ).
If you ran download_data/fetch_data.sh
that file already exists in catalogue folder under name per_run_assembly.tsv
.
Otherwise, script can be helpful to collect that information from ENA.
id | assembly_software |
---|---|
SRR1631112 | Assembler_vVersion |
Manually choose the most appropriate metagenome from https://www.ebi.ac.uk/ena/browser/view/408169?show=tax-tree.
Comma-separated environment parameters in format:
"environment_biome,environment_feature,environment_material"
Use final_table_for_uploader.tsv
to upload your MAGs with uploader.
There is example here.
! Do not modify existing output structure because that TSV file contains full paths to your genomes.
final_table_for_uploader.tsv
unclassified_genomes.txt
bins
--- eukaryotes
------- run_accession
----------- bins.fa
--- prokaryotes
------- run_accession
----------- bins.fa
coverage
--- eukaryotes
------- coverage
----------- aggregated_contigs2bins.txt
------- run_accession_***_coverage
----------- coverage.tab
----------- ***_MAGcoverage.txt
--- prokaryotes
------- coverage
----------- aggregated_contigs2bins.txt
------- run_accession_***_coverage
----------- coverage.tab
----------- ***_MAGcoverage.txt
genomes_drep
--- eukaryotes
------- dereplicated_genomes.txt
------- genomes
----------- genomes.fa
--- prokaryotes
------- dereplicated_genomes.txt
------- genomes
----------- genomes.fa
intermediate_steps
--- binning
--- eukaryotes
------- eukcc
------- qs50
--- fastp
--- prokaryotes
------- gunc
------- refinement
rna
--- cluster_name
------- cluster_name_fasta
----------- ***_rRNAs.fasta
------- cluster_name_out
----------- ***_rRNAs.out
----------- ***_tRNA_20aa.out
stats
--- eukaryotes
------- busco_final_qc.csv
------- combined_busco_eukcc.qc.csv
------- eukcc_final_qc.csv
--- prokaryotes
------- checkm2
----------- aggregated_all_stats.csv
----------- aggregated_filtered_genomes.tsv
------- checkm_results_mags.tab
taxonomy
--- eukaryotes
------- all_bin2classification.txt
------- human_readable.taxonomy.csv
--- prokaryotes
------- gtdbtk_results.tar.gz
pipeline_info
--- software_versions.yml
If you use this pipeline please make sure to cite all used software.