Overview
Table of Contents | ||
---|---|---|
|
From RNA-Seq workshop
2. Install required R packages
Copy and paste the following In this section we’re going to:
Install and load the R packages we need to run the analysis
Import our taxonomic abundance table into R
View a summary of our abundance table
We’re going to be running various commands in R. To do this, copy and paste the code into the R script you just created, highlight all the code you want to run, then press the run button:
...
Install required R packages
Copy and paste the following code (highlight all the code in your into the R script you just created, then press run the run button)code. This will install all the required packages and dependencies and may take 30 45 minutes or more to complete. It may prompt you occasionally to update packages - select 'a' for all if/when this occurs.
...
NOTE: you only need to run this section once on any if you’re running this analysis on your own laptop/PC, and you don’t need to run it if you’re using an rVDI machine as all the packages are already installed.
Code Block |
---|
#### DifferentialMetagenomics expression analysis #### # When you see '## USER INPUT', this means you have to modify the code for your computer or dataset. All other code can be run as-is (i.e. you don't need to understand the code, just run it) #### 21. Installing required packages #### # **NOTE: this section only needs to be run once (or occasionally to update the packages) # Install devtools install.packages("devtools", repos = "http://cran.us.r-project.org") # Install R packages. This only needs to be run once. # Variannt calling Make a vector of CRAN and Bioconductor packages bioconductor_packages <- c("VariantAnnotation", "biomaRt", "clusterProfiler", "org.Hs.eg.db") cran_packages <- c("devtools", "tidyverse", "DT", "gt", "openxlsx", "dplyr", "scales", "ggplot2", "plotly", "tidyr", "ggsci", "viridis", "vcfR", "data.table") # Matagenomics packages # CRAN: library(tidyverse) library(scales) install.packages(, "remotes", verbose = F) remotes::install_github("MadsAlbertsen/ampvis2", quiet = T) library(ampvis2) # Bioconductor: bioconductor_packages <- c("DESeq2", "EnhancedVolcano", "org.Hs.eg.db", "org.Mm.eg.db", "org.Rn.eg.db", "org.EcK12.eg.db", "org.EcSakai.eg.db", "org.Dr.eg.db", "org.Dm.eg.db") cran_packages <- c("ggrepel", "ggplot2", "plyr", "reshape2", "readxl", "FactoMineR", "factoextra", "pheatmap") "vegan") # Compares installed packages to above packages and returns a vector of missing packages new_packages <- bioconductor_packages[!(bioconductor_packages %in% installed.packages()[,"Package"])] new_cran_packages <- cran_packages[!(cran_packages %in% installed.packages()[,"Package"])] # Install missing bioconductor packages if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(new_packages) # Install missing cran packages if (length(new_cran_packages)) install.packages(new_cran_packages, repos = "http://cran.us.r-project.org") # Update all installed packages to the latest version update.packages(bioconductor_packages, ask = FALSE) update.packages(cran_packages, ask = FALSE, repos = "http://cran.us.r-project.org") |
...
# Install ampvis2 (needs to be installed from Github)
remotes::install_github("kasperskytte/ampvis2")
|
Load required R packages
This section loads the packages you’ve installed in the previous section. Unlike installing packages, this needs to be run every time and should only take a few seconds to run.
Code Block |
---|
#### 32. Loading required packages #### # This section needs to be run every time # Load packages bioconductor_packages <- c("DESeq2VariantAnnotation", "biomaRt", "EnhancedVolcanoclusterProfiler", "org.Hs.eg.db") cran_packages <- c("devtools", "org.Mm.eg.dbtidyverse", "org.Rn.eg.dbDT", "org.EcK12.eg.dbgt", "org.EcSakai.eg.dbopenxlsx", "org.Dr.eg.dbdplyr", "org.Dm.eg.db") cran_packages <- c("ggrepelscales", "ggplot2", "plotly", "ggplot2tidyr", "plyrggsci", "reshape2viridis", "readxlvcfR", "FactoMineRdata.table", "factoextraremotes", "pheatmapvegan") lapply(cran_packages, require, character.only = TRUE) lapply(bioconductor_packages, require, character.only = TRUE) |
4. Import your data files into R
In this section, we will import your count table and samples table into R.
You’ll need to change the ‘setwd
' line to your working directory. Click
library(ampvis2)
|
Set your working directory
‘Working directory’ is an important concept in R. It defines where R automatically looks for data files and where it outputs results (tables, figures, etc).
To set your working directory, click ‘Session’ → ‘Set working directory’ → ‘Choose working directory’ and then choose the analysis workshop directory you created previously that contains your R script file and the 'Data’ directory.
Code Block |
---|
#### 4. Import your count data ####
# Make sure you have: a) your count table (salmon.merged.gene_counts.tsv file, if you used Nextflow nfcore/rnaseq to analyse your data). Copy this to a subdirectory called 'data'. b) your metadata file. This should be either an Excel file called 'metadata.xlsx' or a tab-separated text file called 'metadata.txt'. It needs 3 columns called 'sample_name', 'sample_ID' and 'group'. The sample names should be EXACTLY the same as the names in the count table. These names are often uninformative and long, so the 'sample_ID' is the sample labels you want to put on your plots. E.g. if you have a 'high fat' group, you might want to rename the samples HF1, HF2, HF3, etc)
## USER INPUT
# Set working directory.
# Change this to your working directory (In the RStudio menu: Session -> Set working directory -> Choose working directory)
setwd("H:/workshop/RNAseq")
# Import your count data. make sure you've created a 'data' subdirectory and put the count table file there.
metacountdata <- read.table("./data/salmon.merged.gene_counts.tsv", header = TRUE, row.names = 1)
# Import metadata. Again, need a metadata.xlsx file in the data subdirectory.
meta <- read_excel("./data/metadata.xlsx")
# Remove 1st columns of metadata (gene_name)
counts <- metacountdata[ ,2:ncol(metacountdata)]
# Rename sample names to new sample IDs
counts <- counts[as.character(meta$sample_name)]
colnames(counts) <- meta$sample_ID
# Counts need to be rounded to integers
counts <- ceiling(counts) |
From Jupyter Notebooks
Table of DADA2 filtration
The following table shows how many reads were kept and removedH:/meta_workshop/R_analysis directory.
Summary of DADA2 quality filtration
DADA2 is the main tool used by the nfcore/ampliseq workflow to examine taxonomy in amplicon datasets. It initially completes a series of quality filtration steps, including quality filtration, denoising and chimeric read removal.
As an initial task, we’re going to generate a table and figure that shows the percentage of reads were kept and removed in each filtration step. Percentage columns are the filtered, merged, and non-chimeric reads compared to the number of original, unfiltered reads. Thus the number of non-chimeric reads and percentage is the number and % of reads remaining after all filtration steps.
First, choose your working directory. This is the directory where your ampliseq results directories are. See the next section, Alpha diversity, for more details on this.
Code Block |
---|
setwd("~/Mahsa_paper_1") |
Then, import the DADA2 stat and create the percentage columns:
...
Copy, paste and run the following into your RStudio script:
Code Block |
---|
#### 2. Summary of DADA2 quality filtration #### ## Make a summary table # Import the DADA2 filtration summary information dada <- read.table("../illumina/results/abundance_table/unfiltered/dadadada2/dada2_stats.tsv", sep = "\t", header = T) # Calculate the filtration percentages and add as new columns dada$percentage.of.input.passed.filter <- paste0(round(dada$percentage.of.input.passed.filter, ((dada$filtered/dada$DADA2_input)*100,2), "%") dada$percentage.of.input.denoised <- paste0(round(dada$percentage.of.input.denoised, ((dada$denoised/dada$DADA2_input)*100,2), "%") dada$percentage.of.input.non.chimeric <- paste0(round(dada$percentage.of.input.non.chimeric, ((dada$nonchim/dada$DADA2_input)*100,2), "%") # Rearrange the columns and change the column names dada <- dada[c(1,2,3,6,4,7,5,8)] colnames(dada) <- c("Sample", "Unfiltered_reads", "Filtered", "%", "Denoised", "%", "Non-chimeric", "%") |
To generate the table we'll use the DT: datatables package. This creates a table that is searchable, columns can be ordered, table can be exported as a csv or Excel file, etc.
Code Block |
---|
DT::datatable(dada, rownames =# F,Export as a csv width = "100%", extensions = 'Buttons', options = list(scrollX = TRUE, dom = 'Bfrtip', columnDefs = list(list(className = 'dt-center', targets="_all")), buttons = list('copy', 'print', list( extend = 'collection', buttons = list( list(extend = 'csv', filename = "DADA_filtration"), list(extend = 'excel', filename = "DADA_filtration"), list(extend = 'pdf', filename = "DADA_filtration")), text = 'Download' )) ) ) file write.csv(dada, "DADA2_filtration.csv") |
The final line in the above code box will write out the results as a table called DADA2_filtration.csv
. It will write this out in your working directory (H:/meta_workshop/R_analysis). You can go there in Windows File Explorer and open the table in Excel.
Stacked bar plot of DADA2 filtration
Now we'll plot the above DADA2 filtration results (filtered vs unfiltered reads) on as a stacked barplot.
We'll do this using the ggplot package.
Import the dataRun the following in your R script:
Code Block |
---|
## Plot the summary results # Pull out the required columns dada_bp <- data.frame(dada$Sample, dada$Unfiltered_reads - dada$`Non-chimeric`, dada$`Non-chimeric`) colnames(dada_bp) <- c("Sample", "Removed reads", "Remaining reads") df <- tidyr::pivot_longer(dada_bp, cols =c("Removed reads", "Remaining reads")) colnames(df) <- c("Sample", "Reads", "Read_count") options(scipen = 999) |
Now plot this:
Code Block |
---|
options(repr.plot.width=12, repr.plot.height=8) # Plot the DADA2 filtration results p <- ggplot(df, aes(Sample, Read_count, fill=Reads)) + geom_bar(stat = "identity", position = 'stack') p <- p + scale_fill_brewer(palette="Dark2") + ylab("Read count") + theme_bw() + theme(text = element_text(size = 14), axis.text.x = element_text(angle = 90, size = 10)) p |
The above will display the plot in RStudio.
You can also save your plot as a 300dpi (i.e. publication quality) tiff or pdf file. These files can be found in your working directory.
Tip: you can adjust the width and height of the saved images by changing width =
and height =
in the code below. Pdf files can be opened within Jupyter, so a good way to test a suitable width/height would be to save the image by running the pdf code below with the default 20cm width/height, then open the pdf file by clicking on it in the file browser panel (to the left of this notebook), then change the width/height and repeat this process as needed.
Export as a 300dpi tiff
...
Code Block |
---|
# Export as a 300dpi tiff tiff_exp <- "DADA2_summary.tiff" ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") |
...
# Export as a pdf |
...
pdf_exp <- "DADA2_summary.pdf" ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") |
Go to the next section: Alpha diversity