Table of Contents

Aim:

Identify statistically significant (FDR < 0.05) differentially expressed genes. Visualise results with PCA plots, heatmaps and volcano plots.

...

To access this count table:

Go to the/sandpit/demo/run3_full_pipeline/ folder that contains the results from running the nfcore/rnaseq pipeline. The output folders from task 3 look like this:

...

Now let’s find the full path to the ‘salmon.merged.gene_counts.tsv’ file:

Windows:
- see: https://www.lifewire.com/change-directories-in-command-prompt-5185508
Mac:
- cd /folder/that/contains/feature_counts/
- pwd
Rstudio:
- Open Rstudio, go to the top bar a click on “Session” → “Select working directory: → “Choose directory”
- The path to the directory will be printed in Rstudio console, copy and paste in line 1 of the script ‘RNAseq_DESeq2_analysis.R’ (see below)

...

Preparing your data. 2 data files needed: a samples table and your count table.
Install required R packages (only need to run once) - After installation we only need to load the packages. NOTE: If using an rVDI virtual machine, the R packages are already installed.
Load required R packages. Unlike installing the packages, this needs to be done every time you run the analysis
Import your data files (count table and samples table)into R
Checking for outliers and batch effects
1. PCA plot
2. Pairwise samples heatmap
Identify differentially expressed (DE) genes using DESeq2
1. Annotating your DE genes
2. Volcano plot
3. DE genes heatmap

1. Preparing your data

You will need 2 data files to complete this analysis: your count table (see above) and a samples table.

...

A copy of the full script is at /demo/run3_full_pipeline/

2. Install required R packages

Copy and paste the following code into the R script you just created, then run the code (highlight all the code in your R script, then press the run button. This will install all the required packages and dependencies and may take 30 minutes or more to complete. It may prompt you occasionally to update packages - select 'a' for all.

...

Code Block

#### 3. Loading required packages ####

# This section needs to be run every time
# Load packages
bioconductor_packages <- c("DESeq2", "EnhancedVolcano", "org.Hs.eg.db", "org.Mm.eg.db", "org.Rn.eg.db", "org.EcK12.eg.db", "org.EcSakai.eg.db", "org.Dr.eg.db", "org.Dm.eg.db")
cran_packages <- c("ggrepel", "ggplot2", "plyr", "reshape2", "readxl", "FactoMineR", "factoextra", "pheatmap")
lapply(cran_packages, require, character.only = TRUE)
lapply(bioconductor_packages, require, character.only = TRUE)

4. Import your data files into R

In this section we will import your count table and samples table into R.

...

Code Block

#### 4. Import your count data ####

# Make sure you have: a) your count table (salmon.merged.gene_counts.tsv file, if you used Nextflow nfcore/rnaseq to analyse your data). Copy this to a subdirectory called 'data'. b) your metadata file. This should be either an Excel file called 'metadata.xlsx' or a tab-separated text file called 'metadata.txt'. It needs 3 columns called 'sample_name', 'sample_ID' and 'group'. The sample names should be EXACTLY the same as the names in the count table. These names are often uninformative and long, so the 'sample_ID' is the sample labels you want to put on your plots. E.g. if you have a 'high fat' group, you might want to rename the samples HF1, HF2, HF3, etc)

## USER INPUT
# Set working directory. 
# Change this to your working directory (In the RStudio menu: Session -> Set working directory -> Choose working directory)
setwd("C:/Users/whatmorp/OneDrive - Queensland University of Technology/Desktop/Projects/RNA-Seq downstream analysis")

# Import your count data. make sure you've created a 'data' subdirectory and put the count table file there.
metacountdata <- read.table("./data/salmon.merged.gene_counts.tsv", header = TRUE, row.names = 1)

# Import metadata. Again, need a metadata.txt file in the data subdirectory.
meta <- read_excel("./data/metadata.xlsx")

# Remove 1st columns of metadata (gene_name)
counts <- metacountdata[ ,2:ncol(metacountdata)]

# Rename sample names to new sample IDs
counts <- counts[as.character(meta$sample_name)]
colnames(counts) <- meta$sample_ID

# Counts need to be rounded to integers
counts <- ceiling(counts)

5. Checking for outliers and batch effects

In this section we will create PCA plots and heatmaps to examine the relationships between samples. Outlier samples and batch effects can heavily bias your results and should be addressed (e.g. removal of outlier samples from the dataset) before any differential expression analysis is completed.

...

Code Block

#### 5. Outliers and batch effects ####

# This section normalises and transforms the count data so that it can be plotted on a PCA plot and a heatmap

## USER INPUT
# Choose the groups you want to plot in a PCA/Heatmap. You can select any 2 or more of the groups (or all of the groups) you have in your 'groups' column of your metadata table
# To see what groups are present, run the following:
unique(meta$group)
# Now add which groups you want to plot (i.e. replace the groupnames below, and add more, separated by a comma and in "quotes", as needed). NOTE: R is case-sensitive, so these group names must be named EXACTLY the same as in the metadata table.
plotgroups <- c("Differentiated_cells", "Basal_cells")

# Pull out only the counts from the above groups
groupcounts <- counts[meta$group %in% plotgroups]

# Normalise counts by library size, using DeSeq2's estimateSizeFactors() function. Note that DeSeq2 does this internally during DEG calling. The normalisation below is done separately for PCA and density plotting.
# Set up the initial DeSeq2 experimental parameters.
condition <- factor(1:length(groupcounts))
# Set up the column data. A data frame of sample ID's and conditions
coldata <- data.frame(row.names=colnames(groupcounts), condition)
# Set up the DeSeq2 data set structure
f <- DESeqDataSetFromMatrix(countData = groupcounts, colData = coldata, design= ~ condition)
# Estimate the size factors. See DeSeq2 manual for details
f <- estimateSizeFactors(f)
# Size factors can be viewed by: sizeFactors(f)

# Multiply each row (sample) by the corresponding size factor
subcount_norm <- as.matrix(groupcounts) %*% diag(sizeFactors(f))
# Re-add column names
colnames(subcount_norm) <- colnames(groupcounts)

## Remove low coverage transcripts (mean count < 10) ##

# Find the mean of each row (and output as a data frame object)
means <- as.data.frame(rowMeans(subcount_norm))
# Then join the means data with the counts
means <- cbind(means, subcount_norm)
# Then subset out only genes with mean > 10
data <- subset(means, means[ , 1] > 10)
# Remove the means column
data <- data[,-1]

# Transform data
data_log <- vst(round(as.matrix(data)))
# Transformation can create some infinite values. Can't generate PCA data on these. Can see how many by: sum(sapply(data_log, is.infinite))
# To remove infinite rows, use 'is.finte' or '!is.infinite'
data_log <- data_log[is.finite(rowSums(data_log)),]
colnames(data_log) <- colnames(groupcounts)

### Set up the PCA plot base data ###

# We're using the FactoMineR package to generate PCA plots (http://factominer.free.fr/index.html)

# Need to transpose the data first
data_log_t <- t(data_log)
# Add the group data
data_log_t_vars <- data.frame(meta$group[meta$group %in% plotgroups], data_log_t)
# Generate the PCA data using FactoMineR package
res.pca <- PCA(data_log_t_vars, quali.sup = 1, graph=FALSE)

## Set up the dendogram/heatmaps base data ##

# Calculate the distance matrix:
distance_matrix <- as.matrix(dist(t(data_log)))

5a. PCA plot

Now you can run the following code in your R script to generate the PCA plot.

...

Code Block

#### 5a. PCA plot ####

# Generate the PCA plot. Groups are shaded with ellipses at 95% confidence level. NOTE: at least 4 replicates need to be in a group for an ellipses to be drawn.
# NOTE: change the group point colours by changing 'palette = ' below. Use the 'RColourBrewer colour names (https://r-graph-gallery.com/38-rcolorbrewers-palettes.html). For example, if you are plotting 3 groups and choose palette = "Set1", this will use the first 3 colours from the Set1 colour palette.
p <- fviz_pca_ind(res.pca,
                  geom.ind = "point", # show points only (but not "text")
                  col.ind = meta$group[meta$group %in% plotgroups], # color by groups
                  pointsize = 5, title = "", legend.title = "Treatment groups", palette = "Dark2",
                  addEllipses = TRUE, ellipse.type = "t", ellipse.level = 0.95) + theme(legend.text = element_text(size = 12), legend.title = element_text(size = 14), axis.title=element_text(size=16), axis.text=element_text(size=14))

p

# Output as publication quality (300dpi) tiff and pdf.
# This will name your output files with the treatment groups you selected.

# Create a 'figures' subdirectory where all figures will be output
dir.create("figures", showWarnings = FALSE)
# Create a (300dpi) tiff
ggsave(file = paste0("./figures/PCA_", paste(plotgroups, collapse = "_Vs_"), ".tiff"), dpi = 300, compression = "lzw", device = "tiff", width = 10, height = 8, plot = p)
# Create a pdf
ggsave(file = paste0("./figures/PCA_", paste(plotgroups, collapse = "_Vs_"), ".pdf"), device = "pdf", width = 10, height = 8, plot = p)

5b. Pairwise samples heatmap

Now generate the heatmap and dendrogram.

...

Code Block

#### 4b. Samples heatmap and dendrogram ####

# This section plots a heatmap and dendrogram of pairwise relationships between samples. In this way you can see if samples cluster by treatment group.

# See here: https://davetang.org/muse/2018/05/15/making-a-heatmap-in-r-with-the-pheatmap-package/

# Define annotation column
annot_columns <- data.frame(meta$group[meta$group %in% plotgroups])
# Make the row names the sample IDs
row.names(annot_columns) <- meta$sample_ID[meta$group %in% plotgroups]
colnames(annot_columns) <- "Treatment groups"
# Need to factorise it
annot_columns[[1]] <- factor(annot_columns[[1]])

# Generate dendrogram and heatmap
pheatmap(distance_matrix, color=colorRampPalette(c("white", "#9999FF", "#990000"))(50), cluster_rows = TRUE, show_rownames = TRUE, treeheight_row = 0, treeheight_col = 70, fontsize_col = 12, annotation_names_col = F, annotation_col = annot_columns, filename = paste0("./figures/Pairwise_sample_heatmap_", paste(plotgroups, collapse = "_Vs_"), ".tiff"))

# Notes about heatmap colours.
# You can change the colours used in the heatmap itself by changing the colour names (color=colorRampPalette....)
# If you want to change the annotation colours, see here: https://zhiganglu.com/post/pheatmap_change_annotation_colors/

6. Identify differentially expressed (DE) genes using DESeq2

Now we come to the main analysis section of this workflow, where we will identify differentially expressed genes. This will generate two main datapoints per gene, the log fold change, which shows the change in expression levels between two treatment groups for a specific gene, and the adjusted p value, which shows which genes are significantly differentially expressed (adjusted to remove false positives).

...

Code Block

#### 6. Differential expression analysis ####

# In this section we use the Deseq2 package to identify differentially expressed genes.

## USER INPUT
# Choose the treatment groups you want to compare. 
# To see what groups are present, run the following:
unique(meta$group)
# Enter which groups you want to compare (two groups only). BASELINE OR CONTROL GROUP SHOULD BE LISTED FIRST.
degroups <- c("Basal_cells", "Differentiated_cells")

# From the count table, pull out only the counts from the above groups
expdata <- as.matrix(counts[,meta$group %in% degroups])

# Set up the experimental condition
# 'factor' sets up the reference level, i.e. which is the baseline group (otherwise the default baseline level is in alphabetic order)
condition <- factor(meta$group[meta$group %in% degroups], levels = degroups)
# Type 'condition' in the console to see is the levels are set correctly

# Set up column data (treatment groups and sample ID)
coldata <- data.frame(row.names=colnames(expdata), condition)

# Create the DESeq2 dataset (dds)
dds <- DESeq2::DESeqDataSetFromMatrix(countData=expdata, colData=coldata, design=~condition)
dds$condition <- factor(dds$condition, levels = degroups)

# Run DESeq2 to identify differentially expressed genes
deseq <- DESeq(dds)

# Extract a results table from the DESeq analysis
res <- results(deseq)

# Reorder results by adjusted p vales, so that the most signififcantly DE genes are at the top
res <- res[order(res$padj), ]

# You can do a summary of the results to see how many significantly (alpha=0.05, adjust to 0.01 if needed) upregulated and downregulated DE genes were found
summary(res, alpha=0.05)

# Convert from DESeq object to a data frame.
res <- data.frame(res)

# Look at the top 6 DE genes
head(res)

6a. Annotating your DE genes

The table of DE genes will have gene names according to the reference genome that was used to generate the count table. This is usually an Entrez gene ID if the NCBI genome was used, or an Ensemble gene ID if the Ensemble genome was used.

...

Code Block

#### 6a. Annotating your DE genes ####

# Annotation packages for human (org.Hs.eg.db), mouse (org.Mm.eg.db), rat (org.Rn.eg.db), E. coli strain K12 (org.EcK12.eg.db), E. coli strain Sakai (org.EcSakai.eg.db), zebrafish (org.Dr.eg.db) and Drosophila (org.Dm.eg.db) were installed with the required R packages. If your species is not in the above list, contact the eResearch team.

## USER INPUT
# Input your species genome below (select from the above list)
my_genome <- org.Mm.eg.db

# Pull out just the Entrez/Ensembl gene IDs
gene_ids <- row.names(res)

# You can see the list of annotations you can apply to your data by:
keytypes(my_genome)

# Annotate gene symbol and description to your DE gene IDs (you can add other keytypes from the above 'keytypes(my_genome)' list, if you choose)
cols <- c("SYMBOL", "GENENAME")

## USER INPUT
# Provide the gene ID type for your DE data
# If you have Ensemble IDs, enter "ENSEMBL", if you have Entrez IDs, enter "ENTREZID", if you have gene symbols, enter "SYMBOL"
idtype <- "ENSEMBL"

# Map the Entrez/Ensembl gene IDs to gene symbol and description
# NOTE: IF YOUR GENE ID IS AN ENTREZ ID, CHANGE keytype="ENSEMBL" TO keytype="ENTREZID"
map <- AnnotationDbi::select(my_genome, keys=gene_ids, columns=cols, keytype=idtype)

# Combine the annotation data with the DE data
# Since we're matching Ensembl -> Entrez there aren't a 1:1 mapping, so need to merge rather than cbind
annot <- merge(x = res, y = map, by.x = 0, by.y = idtype, all = F)
# There isn't always a 1:1 mapping between gene identifiers, so we also need to remove duplicates
annot_nodups <- annot[!duplicated(annot$Row.names), ]
# Reorder by adjusted p
annot_nodups <- annot_nodups[order(annot_nodups$padj), ]

# Add normalised counts to the output table. This is so you can later plot expression trends for individual genes in R, Excel, etc.
# Need to normalise the counts first, using the size factors calculated by DESeq2 (in the 'deseq' object)
expdata_norm <- as.matrix(expdata) %*% diag(deseq$sizeFactor)
colnames(expdata_norm) <- colnames(expdata)
annot_counts <- merge(x = annot_nodups, y = expdata_norm, by.x = "Row.names", by.y = 0, all = TRUE)

# Export the full annotated dataset as a csv (for journal submission)
# Creates a 'Tables' subdirectory where all tables will be output
dir.create("Tables", showWarnings = FALSE)
write.csv(annot_counts, file=paste0("./Tables/all_genes_", paste(degroups, collapse = "_Vs_"), ".csv"), row.names = FALSE)

# Pull out just significant genes (change from 0.05 to 0.01 if needed)
DE_genes <- subset(annot_counts, padj < 0.05, select=colnames(annot_counts))
# NOTE: add lfc cutoffs if needed. E.g., log2FoldChange > 1 and < log2FoldChange -1 cutoff
# DE_genes <- subset(DE_genes, log2FoldChange > 1 | log2FoldChange < -1, select=colnames(DE_genes))

# Export the sig DE genes as a table (to be used in downstream analysis)
write.csv(DE_genes, file=paste0("./Tables/DE_genes_", paste(degroups, collapse = "_Vs_"), ".csv"), row.names = FALSE)

6b. Volcano plot

Now that we have our list of DE genes, we can visualise them with a volcano plot and heat map.

...

Code Block

#### 6b. Volcano plot ####

p <- EnhancedVolcano(annot_nodups, lab = annot_nodups$SYMBOL, selectLab = annot_nodups$SYMBOL[1:20], drawConnectors = TRUE, title = NULL, subtitle = NULL, x = 'log2FoldChange', y = 'pvalue')

p

# NOTE: the above plot shows labels for the top significantly DE (i.e. by lowest adjusted p value) genes.

# Output as publication quality (300dpi) tiff and pdf.
# Create a (300dpi) tiff
ggsave(file = paste0("./figures/volcano_", paste(degroups, collapse = "_Vs_"), ".tiff"), dpi = 300, compression = "lzw", device = "tiff", width = 10, height = 8, plot = p)
# Create a pdf
ggsave(file = paste0("./figures/volcano_", paste(degroups, collapse = "_Vs_"), ".pdf"), device = "pdf", width = 10, height = 8, plot = p)

6c. DE genes heatmap

This section plots a heatmap and dendrogram of DE gene expression per sample. Expression counts are scaled and centered so that groupwise relationships can be examined.

...

Version	Old Version 25	New Version 26
Changes made by	Paul Whatmore (Deactivated)	Paul Whatmore (Deactivated)
Saved on	Sept 20, 2023	Sept 20, 2023

Versions Compared

Key

Aim:

1. Preparing your data

2. Install required R packages

4. Import your data files into R

5. Checking for outliers and batch effects

5a. PCA plot

5b. Pairwise samples heatmap

6. Identify differentially expressed (DE) genes using DESeq2

6a. Annotating your DE genes

6b. Volcano plot

6c. DE genes heatmap

Content Comparison

Versions Compared

Key

Aim:

1. Preparing your data

2. Install required R packages

4. Import your data files into R

5. Checking for outliers and batch effects

5a. PCA plot

5b. Pairwise samples heatmap

6. Identify differentially expressed (DE) genes using DESeq2

6a. Annotating your DE genes

6b. Volcano plot

6c. DE genes heatmap