Overview

In this section we’re going to:

Install and load the R packages we need to run the analysis
Import our taxonomic abundance table into R
View a summary of our abundance table

We’re going to be running various commands in R. To do this, copy and paste the code into the R script you created, highlight all the code you want to run, then press the run button:

Install required R packages

Copy and paste the following code into the R script you just created, then run the code. This will install all the required packages and dependencies and may take 45 minutes or more to complete. It may prompt you occasionally to update packages - select 'a' for all if/when this occurs.

NOTE: you only need to run this section once if you’re running this analysis on your own laptop/PC, and you don’t need to run it if you’re using an rVDI machine as all the packages are already installed.

#### Metagenomics analysis ####

# When you see '## USER INPUT', this means you have to modify the code for your computer or dataset. All other code can be run as-is (i.e. you don't need to understand the code, just run it)

#### 1. Installing required packages ####

# **NOTE: this section only needs to be run once (or occasionally to update the packages)
# Install devtools
install.packages("devtools", repos = "http://cran.us.r-project.org")
# Install R packages. This only needs to be run once.

# Make a vector of CRAN and Bioconductor packages
bioconductor_packages <- c("VariantAnnotation", "biomaRt", "clusterProfiler", "org.Hs.eg.db")
cran_packages <- c("devtools", "tidyverse", "DT", "gt", "openxlsx", "dplyr", "scales", "ggplot2", "plotly", "tidyr", "ggsci", "viridis", "vcfR", "data.table", "remotes")

# Compares installed packages to above packages and returns a vector of missing packages
new_packages <- bioconductor_packages[!(bioconductor_packages %in% installed.packages()[,"Package"])]
new_cran_packages <- cran_packages[!(cran_packages %in% installed.packages()[,"Package"])]

# Install missing bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install(new_packages)

# Install missing cran packages
if (length(new_cran_packages)) install.packages(new_cran_packages, repos = "http://cran.us.r-project.org")

# Update all installed packages to the latest version
update.packages(bioconductor_packages, ask = FALSE)
update.packages(cran_packages, ask = FALSE, repos = "http://cran.us.r-project.org")

# Install ampvis2 (needs to be installed from Github)
remotes::install_github("kasperskytte/ampvis2")

Load required R packages

This section loads the packages you’ve installed in the previous section. Unlike installing packages, this needs to be run every time and should only take a few seconds to run.

#### 2. Loading required packages ####

# This section needs to be run every time
# Load packages
bioconductor_packages <- c("VariantAnnotation", "biomaRt", "clusterProfiler", "org.Hs.eg.db")
cran_packages <- c("devtools", "tidyverse", "DT", "gt", "openxlsx", "dplyr", "scales", "ggplot2", "plotly", "tidyr", "ggsci", "viridis", "vcfR", "data.table", "remotes")
lapply(cran_packages, require, character.only = TRUE)
lapply(bioconductor_packages, require, character.only = TRUE)
library(ampvis2)

Set your working directory

‘Working directory’ is an important concept in R. It defines where R automatically looks for data files and where it outputs results (tables, figures, etc).

To set your working directory, click ‘Session’ → ‘Set working directory’ → ‘Choose working directory’ and then choose the H:/meta_workshop/R_analysis directory.

From Jupyter Notebooks

Table of DADA2 filtration

The following table shows how many reads were kept and removed. Percentage columns are the filtered, merged, and non-chimeric reads compared to the number of original, unfiltered reads. Thus the number of non-chimeric reads and percentage is the number and % of reads remaining after all filtration steps.

First, choose your working directory. This is the directory where your ampliseq results directories are. See the next section, Alpha diversity, for more details on this.

setwd("~/Mahsa_paper_1")

Then, import the DADA2 stat and create the percentage columns:

dada <- read.table("results/abundance_table/unfiltered/dada_stats.tsv", sep = "\t", header = T)
dada$percentage.of.input.passed.filter <- paste0(round(dada$percentage.of.input.passed.filter, 2), "%")
dada$percentage.of.input.denoised <- paste0(round(dada$percentage.of.input.denoised, 2), "%")
dada$percentage.of.input.non.chimeric <- paste0(round(dada$percentage.of.input.non.chimeric, 2), "%")
colnames(dada) <- c("Sample", "Unfiltered_reads", "Filtered", "%", "Denoised", "%", "Non-chimeric", "%")

To generate the table we'll use the DT: datatables package. This creates a table that is searchable, columns can be ordered, table can be exported as a csv or Excel file, etc.

DT::datatable(dada, rownames = F,
              width = "100%",
              extensions = 'Buttons',
              options = list(scrollX = TRUE,
                             dom = 'Bfrtip',
                             columnDefs = list(list(className = 'dt-center', targets="_all")),
                             buttons =
          list('copy', 'print', list(
            extend = 'collection',
            buttons = list(
                list(extend = 'csv', filename = "DADA_filtration"),
                list(extend = 'excel', filename = "DADA_filtration"),
                list(extend = 'pdf', filename = "DADA_filtration")),
            text = 'Download'
          ))
      )
    )

Stacked bar plot of DADA2 filtration

Now we'll plot the above results (filtered vs unfiltered reads) on a stacked barplot.

We'll do this using the ggplot package.

Import the data:

dada_bp <- data.frame(dada$Sample, dada$Unfiltered_reads - dada$`Non-chimeric`, dada$`Non-chimeric`)
colnames(dada_bp) <- c("Sample", "Removed reads", "Remaining reads")
df <- tidyr::pivot_longer(dada_bp, cols =c("Removed reads", "Remaining reads"))
colnames(df) <- c("Sample", "Reads", "Read_count")
options(scipen = 999)

Now plot this:

options(repr.plot.width=12, repr.plot.height=8)
p <- ggplot(df, aes(Sample, Read_count, fill=Reads)) + geom_bar(stat = "identity", position = 'stack')
p <- p + scale_fill_brewer(palette="Dark2") + ylab("Read count") + theme_bw() +
  theme(text = element_text(size = 14), axis.text.x = element_text(angle = 90, size = 10))
p

You can save your plot as a 300dpi (i.e. publication quality) tiff or pdf file. These files can be found in your working directory.

Tip: you can adjust the width and height of the saved images by changing width = and height = in the code below. Pdf files can be opened within Jupyter, so a good way to test a suitable width/height would be to save the image by running the pdf code below with the default 20cm width/height, then open the pdf file by clicking on it in the file browser panel (to the left of this notebook), then change the width/height and repeat this process as needed.

Export as a 300dpi tiff

tiff_exp <- "DADA2_summary.tiff"
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

pdf_exp <- "DADA2_summary.pdf"
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

3. Setting up your R environment