title | excerpt | tags | |||
---|---|---|---|---|---|
A tutorial on genome-resolved metagenomics |
Extracting population genomes from millions of short metagenomic reads |
|
- Tom Delmont, Antti Karkman, Jenni Hultman *
This tutorial describes a way to extract and curate population genomes from millions of short metagenomic reads using a methodology called "genome-resolved metagenomics", or simply "binning".
Here are a few definitions we came up with, so we can try to speak the same language today:
-A microbial species: Not clear definition as of today.
-A microbial population: pool of microbial cells sharing most of their genomic content due to a very close evolutionary history (close ancestor in the tree of life).
-A metagenome: sequencing data corresponding to more than one genome.
-A metagenomic assembly: set of DNA sequences called contigs that were reconstructed from metagenomic short reads.
-A population genome: consensus genomic content of a microbial population acquired using a metagenomic assembly.
-Metagenomic binning: the act of clustering contigs from a metagenomic assembly into "bins". Note that not all bins represent population genomes (e.g., phages, plasmids, and all the things we have very little clue about).
-A CONTIGS.db: anvi'o database storing various information regarding the metagenomic assembly output (the FASTA file)
-A PROFILE.db: anvi'o database storing various information regarding metagenomic reads the FASTA file recruited (note: this is the same fasta file used to generated the CONTIGS.db)
-A split: a section of a contig defined by length (for improved binning experience, we display multiple splits for very long contigs in the anvi'o interface).
-A collection: collection of bins sotred in the PROFILE.db (this simply links splits in the CONTIGS.db to bins)
-The working environment: Path in the server where you have all anvi'o files related to this project (it should contain the file MEGAHIT_2500nt_CONTIGS.db
and the folder SAMPLES-MERGED
)
So, we hope you are in a good mood to discover and characterize new genomes!
Today, we are going to use these programs within the platform anvi'o:
anvi-interactive
anvi-import-collection
anvi-refine
anvi-summarize
anvi-rename-bins
You can learn details for each anvi'o program using the -h
flag. For example: anvi-interactive -h
Ok.
As you may remember, we have already done all of this:
- Co-assembly of ten infant gut metagenomes with MEGAHIT sotfware
- Creating a CONTIGS database to make sense of the assembly output (find genes, get GC-content and tetra-nucleotide frequency of contigs)
- Searching for single copy-core genes corresponding to Bacteria, Archaea and Eukarya
- Exporting genes stored in the CONTIGS database, determining their taxonomy and importing the results into the CONTIGS database
- Recruiting short reads from each metagenome using the co-assembly output and Bowtie2 software
- Creating PROFILE databases from the recruited reads (n=10), and merging them into a single PROFILE database (n=1)
In case someone is missing some of the steps, this command will allow you to download the entire directory of anvi'o files for this project (this is a back-up for the workshop, but also a good starting point for someone that just wants to practice binning with anvi'o):
wget https://www.dropbox.com/sh/62yzaohdn0qb9yk/AAAWD4GgM8cFCd06CE5meSHra?dl=0
mv AAAWD4GgM8cFCd06CE5meSHra?dl=0 Anvio_Processed_Data.zip
mkdir Anvio_Processed_Data
mv Anvio_Processed_Data.zip Anvio_Processed_Data
cd Anvio_Processed_Data
unzip Anvio_Processed_Data.zip
There are currently two programs to invoke the interactive interface. We will first use this one: anvi-interactive
. It basically will let you combine and visualize all the data stored in your CONTIGS and PROFILE databases.
You don't need to specify any port when running Anvi'o on your own laptop.
But when running the interactive interface from Taito, you will need your own port, because it is not possible to run two interfaces thru the same port.
The available ports will assigned to each student on the course.
Remember to change the XXXX
to the port you were given.
Open a new ssh window. In mac:
ssh -L XXXX:localhost:XXXX [email protected]
# load bioconda and activate Anvi'o environment
module load bioconda/3
source activate anvio5
in Windows with Putty: In SSH category [+] select "tunnels". Add
Source port: XXXX Destination: localhost:XXXX
Click add and log in to Taito as usual.
Activate anvio virtual environemnt source activate anvio3
and go to the Anvi'o folder in the course folder at your $WRKDIR
.
anvi-interactive -c MEGAHIT_co-assembly_2500nt_CONTIGS.db -p SAMPLES-MERGED/PROFILE.db --server-only -P XXXX
Then open google chrome and go to address
So far, so good?
Please note that if you work from your laptop, please do not add
--server-only -P XXXX
, as it prevents the opening of the window in the browser.
Basically, the interface allows you to manipulate various parameters, zoom in and out in the display and learn names/values using the mouse, save/load your work, and summarize your binning results.
Overwhelmed by the interface? Here is a brief explanation to help digest this new environment:
We hope that by the end of the day all of you will be familiar with the interface.
Clicking on the "Draw" button will show the raw display. The display describes contigs organized into 11,577 splits of 40 kbp or less, along with their mean coverage values across the six metagenomes and other relevant metadata (GC-content and taxonomy of splits especially).
Here is what you should see:
Here is the key part to remember: the ten grey layers correspond to the mean coverage values in the ten metagenomes. For each split, a black color means there is environmental signal. No black color means the split did not recruit any reads.
In addition to the main display, there is a second layer of information describing high-resolution coverage values of contigs across metagenomes. To see it, please put your mouse on a split in the display, and right click to select the inspection
mode, which should open a new window. Let's do it for a few splits, and discuss the observations.
Nice.
Let's play with the different contig clustering options:
It is important to understand what they are based on. This knowledge will be key for the manual binning, and curation of population genomes in downstream analyses.
-Differential coverage: clustering solely based on the differential coverage of contigs across the samples. This metric is often stable across a genome, and will be different between genomes that do not have the same distribution patterns in the dataset.
-Sequence composition: clustering solely based on sequence composition (the tetra-nucleotide frequency) of contigs. This metric is often stable across a genome, and different between genomes from different lineages.
-Differential coverage and sequence composition: clustering using the two metrics for optimal binning resolution
Note: as a strategy, the anvi'o developers decided to trust the assembly, so splits from the same contig will remain together.
We are going to zoom in and out, and use the mouse to make selections of split clusters, using the clustering based on differential coverage
and sequence composition
.
The game is to find as many bins with high completion value, and low redundancy value.
Bins will exhibit different sizes and completion/redundancy values, but how do we assess the completion and redundancy values again? Good question folks.
This is thanks to the program called anvi-run-hmms
, which searched for single copy core genes that should occur once in each microbial genome. Let's say there are 100 of these genes. If all of them are detected once in the selected cluster (i.e., the bin), then the completion is 100% and the redundancy is 0%. If a few genes are detected multiple times, the redundancy value will increase. If a few genes are missing, then it is the completion value that will drop.
Anvi'o has collections dedicated to bacteria, Archaea and Eukarya, and uses random forest to identify Domain bins belong to.
Ok. You have ~10 minutes to identify what you consider are legit bins from the dataset. If you do it faster, feel free to get a coffee or help your friends.
Please do not try just to replicate this, but here is Tom's binning example:
Now that we have binnned to data, and saved the binning collection, next step is to summarize the results using anvi-summarize command line:
anvi-summarize -c MEGAHIT_co-assembly_2500nt_CONTIGS.db -p SAMPLES-MERGED/PROFILE.db -C Bin_Name -o SUMMARY_BINNING --init-gene-coverages
(Note: replace "Bin_Name" with the name you elected to use)
This step create a folder called SUMMARY_BINNING
. Please download this folder into your laptop using scp
, open it and double click on the file called index.html
. This should open a windows in your browser.
Here is Tom's example:
Ok.
Now, let's refine some of the bins using the program anvi-refine
(this is the second way to invoke the interface; it is mostly used to work on a single bin within a collection).
We might start with your Bin_01
:
anvi-refine -c MEGAHIT_2500nt_CONTIGS.db -p SAMPLES-MERGED/PROFILE.db -C TOM_5_BINS -b Bin_01 --server-only -P 8080
And see if some splits should be removed from the bin. If changes are made, simply save the new collection in the interface.
If some of the bins remain with redundancy value >10%, please refine them again, and summarize once again (SUMMARY_BINNING-2 as anvi'o does not want to overwrite the folder SUMMARY_BINNING). The game is to have all bins with redundancy <10%.
OK! Now we have bins with low redundancy values, and some of them look like they represent population genomes!
Cool.
Create a new collection where bins are nicely renamed, and MAGs identified (MAG = metagenome-assembled genome = population genome)
anvi-rename-bins -c MEGAHIT_co-assembly_2500nt_CONTIGS.db -p SAMPLES-MERGED/PROFILE.db --collection-to-read TOM_5_BINS --collection-to-write MAGs --call-MAGs --prefix MEGAHIT --use-highest-completion-score --report-file REPORT
Bins >2 Mbp and those with a completion >70% will be renamed as MAGs (i.e., as population genomes).
Note that these parameters can be modified by playing with parameters of the program (see
anvi-rename-bins -h
)
And summarize the collection:
anvi-summarize -c MEGAHIT_co-assembly_2500nt_CONTIGS.db -p SAMPLES-MERGED/PROFILE.db -C MAGs -o SUMMARY_MAGs
So, how many MAGs did you get???
Now is the time for some genomic curation. This step is boring, but critical: we need to manually curate each one of the MAGs using the anvi-refine
command line:
anvi-refine -c MEGAHIT_co-assembly_2500nt_CONTIGS.db -p SAMPLES-MERGED/PROFILE.db -C MAGs -b MEGAHIT_MAG_000001 --server-only -P 8080
and so one for all MAGs. After that, we will create a final collection called MAGs_FINAL
:
anvi-rename-bins -c MEGAHIT_2500nt_CONTIGS.db -p SAMPLES-MERGED/PROFILE.db --collection-to-read MAGs --collection-to-write MAGs_FINAL --call-MAGs --prefix MEGAHIT --use-highest-completion-score --report-file REPORT
and summarize the final, curated collection:
anvi-summarize -c MEGAHIT_co-assembly_2500nt_CONTIGS.db -p SAMPLES-MERGED/PROFILE.db -C MAGs_FINAL -o SUMMARY_MAGs_FINAL
We are done with the binning and the curation of this metagenomic co-assembly output!
Tom got seven bacterial population genomes. What did you get?
Here is the end product for Tom:
And this is the perspective of these MAGs in the interface:
We should discuss why contigs from each MAG are not next to each other. This is a key advantage of the manual binning, and explains well why automatic binning as it is developped today is noy working well (Tom's opinion, at least).
PS: if you want, you can download the collection called collection-TOM_only_MAGs_FINAL.txt
on the Github, import it into the PROFILE.db (program is called anvi-import-collection
), and visualize it in the interface. Alternatively you can use wget https://www.dropbox.com/s/k4b3acezmdlh30j/collection-TOM_only_MAGs_FINAL.txt?dl=0
and mv collection-TOM_only_MAGs_FINAL.txt?dl=0 collection-TOM_only_MAGs_FINAL.txt
.
Anvi'o produced a FASTA file for each MAG, along with various parameters regarding notably their environmental detection:
Here are values for each MAG that are particularly useful:
- Mean coverage across metagenomes (how abundant a MAG is in a given metagenome)
- Detection across genomes (percentage of the nucleotides covered by the recruited reads)
- Table of genes and identified functions (tip: very useful when you have metatranscriptomes!)
- All identified rRNAs (do we have a 16S rRNA gene?)
Let's create a new directory called Population_genomes
and copy the MAGs there:
mkdir Population_genomes
cp SUMMARY_MAGs_FINAL/bin_by_bin/*/*MAG*.fa Population_genomes
cd Population_genomes/
ls
OK.
Let's run CheckM from this directory now, and learn about the taxonomy of each MAG using this advanced tool.
Open a screen for checkM
and allocate resources with salloc
or use sinteractive
and run the checkM
on a computing node. You will also need to load the biokit because checkM
uses HMMER3
.
We will leave the checkm tree
command running for the night. So deattach from the screen after it started running.
# Activate the CheckM environment
source activate checkm_env
module load biokit
# Find the marker genes and place them in a pre-computed phylogenetic tree
checkm tree ./ -x .fa -t 1 ./TREE
# Parsing marker genes and get taxonomy
checkm tree_qa TREE -f OUTPUT.txt
Nice. let's look at the results:
cat OUTPUT.txt
With the type of data extracted from the metagenomes, it could be possible to perform nice "combo analyses". For instance, by combining taxonomy, phylogeny and functions like in here:
We also provide tutorials for these specific topics:
- Working with (meta)transcriptomics data
- Analysing single nucleotide variants
- Describing the anvi'o metagenomic workflow
- Describing the anvi'o pangenomic workflow for comparative genomics
- Editing figures anvi'o produces
And blogs using anvi'o: