Overview
Taxonomic diversity can be measured using a variety of statistical models. Alpha diversity represents the species diversity (richness, evenness, compositional complexity) within experimental samples and treatment groups.
'Diversity' is based on the amplicon sequence variants (ASV) detected by DADA2 per sample.
Alpha diversity was estimated using Shannon’s index, Simpson's index, Chao1 richness and observed ASVs. Statistical differences between treatment groups for each of the diversity indices was calculated using pairwise Kruskal-Wallis. Species richness is also plotted as rarefaction curves.
Multiple diversity indices were used because each have different strengths and weaknesses and are better contextually - i.e. depending on the structure and nature of your dataset. For example, Shannon’s index is more sensitive to sample size and influenced by rare ASVs. Simpson's index is less sensitive to sample size and or rare ASVs, but is influenced by dominance/abundance of ASVs. The multiple methods are included in this report so the researcher can decide which is most appropriate for their experimental design.
A good overview of most of these methods can be seen in this paper:
https://www.jmb.or.kr/submission/Journal/027/JMB027-12-02_FDOC_2.pdf
The plots and tables in each section (where each section = a variable) are as follows:
Shannon's index calculates richness and diversity using a natural logarithm and accounts for both abundance and evenness of the taxa present. Shannon's index is a universally accepted index for diversity as it accounts for entropy in an ecosystem or in representative samples.
Shannon, C.E. and Weaver, W. (1949). “The mathematical theory of communication”. University of Illonois Press, Champaign, Illonois.
Simpson's index is used to estimate dominance of the species but does not account for species richness.
Simpson, E. H. (1949). Measurement of diversity. nature, 163(4148), 688-688.
Chao1 richness is a non-parametric estimator of diversity, based on abundance.
Chao, A. (1984). Nonparametric estimation of the number of classes in a population. Scandinavian Journal of statistics, 265-270.
Observed ASVs calculates diversity based on the number of distinct Amplicon Sequence Variants. Originally this method was developed for observed OTUs, as seen in the below paper.
DeSantis, T.Z., Hugenholtz, P., Larsen, N., Rojas, M., Brodie, E.L., Keller, K. Huber, T., Davis, D., Hu, P., Andersen, G.L. (2006). “Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB”. Applied and Environmental Microbiology (72): 5069–5072.
Combined Shannon and Observed ASV. An additional plot combing both Shannon’s index and Observed ASV indices has been included, to compare similarities and differences between these results. As each index uses different units, results for both have been normalised between 0 and 1.
Kruskal-Wallis rank sum test is a rank-based nonparametric test that can be used to determine if there are statistically significant differences between two or more groups. This statistical analysis is provided for each plot, to estimate if there is a significant difference (q.value < 0.05) between all groups.
Pairwise Wilcoxon rank sum test (AKA: Mann-Whitney test is the same as the Kruskal-Wallis test, but applied pairwise to each group (technically, The Kruskal-Wallis test is the generalization of the Wilcoxon rank sum test).
3. Preparing your data
Import the samples table
When you ran your sequences through the ampliseq pipeline, you submitted the samples with a metadata file. This file contains information on your samples and variables. We need to import this metadata file to run our analysis on selected variables.
samples_table <- read.table("metadata.tsv", sep = "\t", header = T)
Have a look at your samples table and variables (metadata):
samples_table
Import the ASV abundance table
First import the unfiltered ASV table (ampvis2 does internal filtering).
asvtable <- read.table("./results/abundance_table/unfiltered/feature-table.tsv", check.names = FALSE, sep = "\t", skip = 1, stringsAsFactors = FALSE, comment.char = "", header = T) colnames(asvtable)[1] <- "ASV"
Have a look at the top few rows of the ASV table, to see if it looks right. The sample IDs should be the column names and the ASV IDs in the first column. All the other columns should contain numbers (i.e. the count of the number of times each ASV was found in each sample).
head(asvtable)
Important: always run this next code cell, even if you haven't subset your data If you're working with a subset of your data (see section 3) you'll need to choose a name for your subgroup (e.g. subgroup <- "high_fat")
or if you haven't subset your data (i.e. you're working with all your samples) subgroup <- "all"
. This is for naming output plots and files by subset group, so you're not overwriting them each time you run a subset of your data.
subgroup <- "all"
Convert the imported data to ampvis2 format
The following code cell manipulates the data in a variety of ways (see the in-code #comments for explanations) to prepare the data for conversion to an ampliseq2 database.
Read in the taxonomy data - i.e. the taxonomic assignments for each ASV
mytax <- read.table("./results/taxonomy/taxonomy.tsv", sep = "\t", header = T, quote = "")
Each taxonomic level needs to be in a separate column, and in taxonomy.tsv they are all combined and separated by ';' so we need to split by that delimiter. Use tidyr::separate function for this.
mytaxsplit <- tidyr::separate(data = mytax, col = Taxon, into = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = ";")
Combine the taxonomy and ASV tables
asv_table <- cbind(asvtable, mytaxsplit) # Also remove the Feature.ID and Confidence columns, as they are not needed asv_table <- subset(asv_table, select=-c(Feature.ID, Confidence))
The taxonomy data from ampliseq is prefaced by D_0_
D_1_
etc but ampvis2 expects taxonomy assignments to be k_
p_
, etc (kingdom, phylum, etc). So we need to convert these using gsub()
asv_table$Kingdom <- gsub("D_0", "k", asv_table$Kingdom) asv_table$Phylum <- gsub("D_1", "p", asv_table$Phylum) asv_table$Class <- gsub("D_2", "c", asv_table$Class) asv_table$Order <- gsub("D_3", "o", asv_table$Order) asv_table$Family <- gsub("D_4", "f", asv_table$Family) asv_table$Genus <- gsub("D_5", "g", asv_table$Genus) asv_table$Species <- gsub("D_6", "s", asv_table$Species)
Now combine the samples data with the ASV table using amp_load(). This creates an ampvis2 database that can be used by ampvis2
ampvisdata <- amp_load(otutable = asv_table, metadata = samples_table)
4. Choosing a categorical variable to analyse
In your metadata you'll usually have multiple variables. These need to be analysed individually, by selecting the variable in this section, then running the remaining analysis sections on this chosen variable. You can then re-run the analysis on another variable by returning to this section, changing the variable name, then running again the remaining analysis sections.
NOTE This section is for choosing categorical variables only. See section 8 onward for analysis of continous (i.e. numeric) variables.
You can view your variables as column names in your samples_table:
colnames(samples_table)
Enter the column name of the variable you want to analyse (i.e. change group <- "Myvariable"
in the below cell to your chosen variable's column name). This has to be exactly the same as the column name, including capitalisation, characters such as underscores, etc.
group <- "Time"
Ordering your variable
The plotting done in ampvis2 is done by the ggplot2 package. ggplot factorises variables and automatically orders them on the plot by alphabetical order.
This can cause your groups to be ordered incorrectly on the plot axes (e.g. a time series may not be plotted sequentially).
You can manually set the order of your variable here.
First have a look at how ggplot will order your variable.
levels(factor(ampvisdata$metadata[[group]]))
If these are in the order you want to see them on your plot axes, nothing needs to be done. If they are in the wrong order you need to order them manually by setting the levels.
Choose how you want to order your groups here:
lev <- c("0h", "2h", "4h", "6h", "8h", "16h", "24h", "32h", "36h", "48h", "60h", "72h", "96h", "120h")
To order your variable you need to put all the variable levels into the lev = c(..)
. Make sure each level is in double quotes and separated by a comma.
Then run the following to apply the levels to your data:
ampvisdata$metadata[[group]] <- factor(ampvisdata$metadata[[group]], levels = lev)