Dr. Igor S. Pessi
Arctic Microbial Ecology
Department of Microbiology
[email protected]
After quality trimming, chimera removal, OTU clustering and taxonomic assignment we have transformed the data into a biom file that contains the OTU table and the taxonomic assignment of each OTU.
In this tutorial we will continue our analysis of 16S rRNA gene data using the R package phyloseq. This will be a very basic overview, more information and many good tutorials can be found in their website: https://joey711.github.io/phyloseq/index.html.
First we need to install R. Just download below the appropriate version for your computer and follow the instructions:
- Windows: https://cran.rstudio.com/bin/windows/base/R-3.6.0-win.exe
- Mac OSX 10.6 (Snow Leopard) – 10.8 (Mountain Lion): https://cran.rstudio.com/bin/macosx/R-3.2.1-snowleopard.pkg
- Mac OSX 10.9 (Mavericks) and higher: https://cran.rstudio.com/bin/macosx/R-3.3.3.pkg
- Mac OSX 10.11 (El Capitan) and higher https://cran.rstudio.com/bin/macosx/R-3.6.0.pkg
Once R has finished installing, we can then install RStudio:
- Windows 7+ (64-bit): https://download1.rstudio.org/desktop/windows/RStudio-1.2.1335.exe
- Mac OSX 10.12+ (64-bit): https://download1.rstudio.org/desktop/macos/RStudio-1.2.1335.dmg
Then we need to install the packages we will use in our analyses. Open RStudio and type in the console:
# phyloseq
install.packages("BiocManager")
BiocManager::install("phyloseq")
# ggplot2
install.packages("ggplot2")
# vegan
install.packages("vegan")
DISCLAIMER: The only way to become proficient in R is by typing, making mistakes and trying again. So I strongly advise that in the examples given below you type the commands instead of copying and pasting them. You have been warned!
DISCLAIMER 2: R (as many other programming/scripting languages) is case-sensitive. That means that a and A are treated as different symbols, so pay attention when you’re typing!
First we need to tell R where our working directory is located. This is the folder where you have put the biom and metadata files. In the RStudio menu bar (on the top of the window), click in Session > Set Working Directory > Choose Directory. In the window that popped up, navigate to the folder, select it and click Open.
The next step now is to load the newly-installed packages:
library("phyloseq")
library("ggplot2")
library("vegan")
Now we have phyloseq hopefully working and we can move on reading in the data. Take care that the paths to the files and all filenames are correct. The biom fle can be read in to R with the function import_biom().
mydata <- import_biom("otu_table_added_taxa.biom")
Let's investigate the object mydata. For that, just type:
mydata
To access the OTU and taxonomy tables, we can do:
otu_table(mydata)
tax_table(mydata)
The way our taxonomy table is formatted will cause some downstream errors:
head(tax_table(mydata))
## Taxonomy Table: [6 taxa by 7 taxonomic ranks]:
## Rank1 Rank2 Rank3
## OTU68 "Bacteria(100)" "Proteobacteria(100)" "Alphaproteobacteria(100)"
## OTU9 "Bacteria(100)" "Proteobacteria(100)" "Gammaproteobacteria(100)"
## OTU1 "Bacteria(100)" "Proteobacteria(100)" "Alphaproteobacteria(100)"
## OTU633 "Bacteria(100)" "Acidobacteria(100)" "Solibacteres(100)"
## OTU41 "Bacteria(100)" "Acidobacteria(100)" "Subgroup_2(100)"
To continue we need to remove the confidence values assigned to each taxonomic level (the numbers inside parentheses). To do this, type:
tax_table(mydata) <- gsub("\\([0-9]{1,3}\\)", "", tax_table(mydata))
head(tax_table(mydata))
Also, the naming of the taxonomic ranks is a bit non-informative:
colnames(tax_table(mydata))
## [1] "Rank1" "Rank2" "Rank3" "Rank4" "Rank5" "Rank6" "Rank7"
Let's make them more informative:
colnames(tax_table(mydata)) <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species")
colnames(tax_table(mydata))
Let's now remove the negative control from our dataset:
mydata <- prune_samples(sample_names(mydata)[1:10], mydata)
And for last but not least, let's read in the metadata:
metadata <- read.table("metadata.txt", sep="\t", header=TRUE, row.names=1)
sample_data(mydata) <- sample_data(metadata)
Now that we have a phyloseq object with all the data we need, we can have a look at it and understand how the different parts can be extracted from the object. Just by giving the name of the object you see the different objects wrapped inside it. It also tells you the the number of samples, taxa and metadata variables. Also you can see the functions to call the different objects.
mydata
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 4727 taxa and 10 samples ]
## sample_data() Sample Data: [ 10 samples by 5 sample variables ]
## tax_table() Taxonomy Table: [ 4727 taxa by 7 taxonomic ranks ]
For example, to take a look at the first rows of the taxonomy table:
head(tax_table(mydata))
## Taxonomy Table: [6 taxa by 7 taxonomic ranks]:
## Domain Phylum Class Order Family Genus Species
## OTU68 "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Rhodospirillales" "Acetobacteraceae" "uncultured" NA
## OTU9 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Xanthomonadales" "Xanthomonadales_Incertae_Sedis" "Acidibacter" NA
## OTU1 "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Rhizobiales" "Bradyrhizobiaceae" "Bradyrhizobiaceae_unclassified" NA
## OTU633 "Bacteria" "Acidobacteria" "Solibacteres" "Solibacterales" "Solibacteraceae_(Subgroup_3)" "Candidatus_Solibacter" NA
## OTU41 "Bacteria" "Acidobacteria" "Subgroup_2" "Subgroup_2_or" "Subgroup_2_fa" "Subgroup_2_ge" NA
## OTU97 "Bacteria" "Planctomycetes" "Planctomycetacia" "Planctomycetales" "Planctomycetaceae" "Isosphaera" NA
And the metadata:
sample_data(mydata)
## Sample Data: [10 samples by 5 sample variables]:
## soilwat bd loi som ph
## A001-15 21.640993 0.1628771 57.22715 42.772853 4.50
## A002-466 22.916895 0.1898886 97.23637 2.763632 5.32
## A003-103 19.388809 0.1484983 26.82480 73.175197 4.53
## A004-30 20.624524 0.8058712 94.40422 5.595779 5.14
## A005-42 21.253949 0.1280436 37.14488 62.855119 4.50
## A006-55 19.401472 0.2331613 52.19726 47.802742 4.32
## A007-69 7.018481 1.0147377 97.64073 2.359273 4.88
## A008-82 16.309944 0.9649689 91.82380 8.176205 5.19
## A009-93 17.777435 0.2347338 48.76970 51.230303 4.27
## A010-104 124.045846 0.1897786 63.53355 36.466448 5.06
We have already removed some unwanted sequences with mothur and that can also be done with phyloseq. There might be still some Chloroplast sequences in the data, so we should remove them.
# Check if there are any chloroplast sequences
subset_taxa(mydata, Class=="Chloroplast")
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 11 taxa and 10 samples ]
## sample_data() Sample Data: [ 10 samples by 5 sample variables ]
## tax_table() Taxonomy Table: [ 11 taxa by 7 taxonomic ranks ]
# And then remove them
mydata <- subset_taxa(mydata, Class!="Chloroplast")
Not all samples have the same amount of sequences or, in another words, we have different library sizes and that might influence our results. There are several ways to normalise the data and we will use the most simple one, proportions. Calculate the library sizes of your samples with the function sample_sums(), and use barplot() to visualize the differences in library size across samples.
lib_sizes <- sample_sums(mydata)
lib_sizes
barplot(lib_sizes)
Even though the differences are rather small, it is still a good idea to transform the OTU counts into proportions:
mydata_prop <- transform_sample_counts(mydata, function(x) x/sum(x))
lib_sizes_prop <- sample_sums(mydata_prop)
lib_sizes_prop
barplot(lib_sizes_prop)
Let's now do some plotting to visualize the structure of the microbial communities. We will start with a simple barplot of phylum abundances:
mydata_phylum <- tax_glom(mydata_prop, taxrank="Phylum")
plot_bar(mydata_phylum, fill="Phylum")
We can go down in the taxonomic ranks and, for example, plot the order abundances instead. But here, since there are so many orders, let's subset our dataset first to select only the 10 most abundant ones:
mydata_order <- tax_glom(mydata_prop, taxrank="Order")
mydata_order_abund <- prune_taxa(names(sort(taxa_sums(mydata_order),TRUE)[1:10]), mydata_order)
plot_bar(mydata_order_abund, fill="Order")
And if we want even narrower, at the genus level:
mydata_genus <- tax_glom(mydata_prop, taxrank="Genus")
mydata_genus_abund <- prune_taxa(names(sort(taxa_sums(mydata_genus),TRUE)[1:10]), mydata_genus)
plot_bar(mydata_genus_abund, fill="Genus")
Take a look at the graphs. What are the most abundant microbial groups in the samples? Are there any differences in microbial community composition between the samples? Do you think that the differences you are seeing are due to what?
Now we will continue with some α- and β-diversity measures. First let's plot the Shannon and inverse Simpson (and other) indices using the raw count data:
plot_richness(mydata)
Now for some β-diversity with the Bray-Curtis dissimilarity index and visualisation with a PCoA ordination. For this we will use the normalised data. We can also do another PCoA with the raw counts and see if there are any differences in the results. We will use the pH values from the metadata to color the points.
mydata_prop_ord <- ordinate(mydata_prop, method="PCoA", distance="bray")
plot_ordination(mydata_prop, mydata_prop_ord, col="ph", title="Normalised counts")
And again with the non-normalised data:
mydata_ord <- ordinate(mydata, method="PCoA", distance="bray")
plot_ordination(mydata, mydata_ord, col="ph", title="Raw counts")
To finalise, let's do another type of analysis, this time to try to figure out the relationship between the microbial community structures and the environment. For this we will use another R package (vegan), as phyloseq does not support this type of analysis. First we need to export the OTU table from phyloseq to a simpler format that vegan understands:
mydata_OTU <- as.data.frame(t(otu_table(mydata)))
Then we run a PERMANOVA test, which is a type of ANOVA for non-parametric data:
adonis(mydata_OTU ~ ., metadata, permutations = 99, method = "bray")
## Call:
## adonis(formula = mydata_OTU ~ ., data = metadata, permutations = 99, method = "bray")
##
## Permutation: free
## Number of permutations: 99
##
## Terms added sequentially (first to last)
##
## Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
## soilwat 1 0.17335 0.17335 0.82367 0.09990 0.60
## bd 1 0.21033 0.21033 0.99937 0.12121 0.45
## loi 1 0.17506 0.17506 0.83179 0.10089 0.58
## ph 1 0.12414 0.12414 0.58986 0.07154 0.83
## Residuals 5 1.05229 0.21046 0.60645
## Total 9 1.73516 1.00000
Take a look at the results. Is there any significant variable that seems to be influencing the structure of the microbial communties?
Finally let's do another type of ordination called distance-based RDA:
mydata_ord_rda <- dbrda(mydata_OTU ~ ., metadata, distance = "bray")
plot(mydata_ord_rda)
What can you conclude from this ordination? Is there any relationship between the microbial communities and the environment?
This was a very basic introduction to R and how to use it to analyse microbial community data. If you want to learn more, I recommend the material below as good starting points. But remember, you only learn R by using it!