Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

Aim:

Identify statistically significant (FDR < 0.05) differentially expressed genes. Visualise results with PCA plots, heatmaps and volcano plots.

Requirements

Run your samples (FASTQ) using the nextflow nf-core/RNA-seq pipeline using ‘star_salmon’ (Task 3 session) or an alternative pipeline that generates feature counts.

Installing R and Rstudio

You will need to have both R and Studio installed in your laptop/desktop for today’s exercises.

Go to the following page https://posit.co/download/rstudio-desktop/ and follow instructions provided to install first R and then Rstudio.

Install R:

https://cran.r-project.org/bin/windows/base/

install Rstudio:

https://posit.co/download/rstudio-desktop/

RNAseq gene expression feature counts

...

Code Block
├── results
│   ├── fastqc
│   ├── multiqc
│   ├── pipeline_info
│   ├── star_salmon
│   └── trimgalore

The gene expression feature counts can be found in the /results/star_salmon/ folder. Let’s access the folder (i.e., cd /run3/results/star_salmon). A list of files and folders in the star_salmon folder will look like this:

...

The analysis scripts in this guide are written in R script. We will be using RStudio, a front-end gui for R, to run the analysis scripts.

You have three main options for running this analysis in RStudio:

  1. Use QUTs rVDI virtual desktop machines

  2. Install R and RStudio on your own PC

  3. Use the provided PCs in the QUT computer labs

Option1: Use QUTs rVDI virtual desktop machines

This is the preferred method, as R and RStudio are already installed, as are all the required R packages needed for analysis. Installing all of these can take over 30 minutes on your own PC, so using an rVDI machine saves time.

rVDI provides a virtual Windows desktop that can be run in your web browser.

To access and run an rVDI virtual desktop:

Go to https://rvdi.qut.edu.au/

Click on ‘VMware Horizon HTML Access

Log on with your QUT username and password

*NOTE: you need to be connected to the QUT network first, either being on campus or connecting remotely via VPN.

Option2: Install R and RStudio on your own PC

Go to the following page https://posit.co/download/rstudio-desktop/ and follow instructions provided to install first R and then Rstudio.

Download and install R, following the default prompts:

https://cran.r-project.org/bin/windows/base/

Download and install RStudio, following the default prompts:

https://posit.co/download/rstudio-desktop/

Option3: Use the provided PCs in the QUT computer labs

The PCs in the computer labs already have R and RStudio installed. If using this option, you will need to install the required R packages (unlike rVDI). The code for installing these packages is in the analysis section below.

Download your count table

The basis for the differential expression analysis is a count table of sequence reads mapped to defined gene regions per sample. There are a variety of methods to generate this count table, but for this exercise we will be using the output from the Nextflow nfcore/rnaseq analysis you completed in the previous workshop sessions.

To access this count table:

Go to the W:\training\rnaseq\runs\run3_RNAseq\results folder that contains the results from running the nfcore/rnaseq pipeline. The output folders from task 3 look like this:

Code Block
├── results
│   ├── fastqc
│   ├── multiqc
│   ├── pipeline_info
│   ├── star_salmon
│   └── trimgalore

The count table can be found in the star_salmon folder. A list of files and folders in the star_salmon folder will look like this:

Code Block
.
├── bigwig
├── CD49fmNGFRm_rep1
├── CD49fmNGFRm_rep1.markdup.sorted.bam
├── CD49fpNGFRpCD49fmNGFRm_rep1.markdup.sorted.bam.bai
├── CD49fpNGFRpCD49fmNGFRm_rep2
├── CD49fpNGFRpCD49fmNGFRm_rep2.markdup.sorted.bam
├── CD49fpNGFRpCD49fmNGFRm_rep2.markdup.sorted.bam.bai
├── CD49fpNGFRpCD49fmNGFRm_rep3
├── CD49fpNGFRpCD49fmNGFRm_rep3.markdup.sorted.bam
├── CD49fpNGFRpCD49fmNGFRm_rep3.markdup.sorted.bam.bai
├── deseq2_qc
├── dupradar
├── featurecounts
├── log
├── MTEC_CD49fpNGFRp_rep1
├── MTECCD49fpNGFRp_rep1.markdup.sorted.bam
├── MTECCD49fpNGFRp_rep1.markdup.sorted.bam.bai
├── MTECCD49fpNGFRp_rep2
├── MTECCD49fpNGFRp_rep2.markdup.sorted.bam
├── MTECCD49fpNGFRp_rep2.markdup.sorted.bam.bai
├── MTECCD49fpNGFRp_rep3
├── MTECCD49fpNGFRp_rep3.markdup.sorted.bam
├── MTECCD49fpNGFRp_rep3.markdup.sorted.bam.bai
├── picard_metricsdeseq2_qc
├── dupradar
├── qualimapfeaturecounts
├── rseqclog
├── salmon.merged.gene_counts_length_scaled.rdsMTEC_rep1
├── salmonMTEC_rep1.merged.gene_counts_length_scaled.tsvmarkdup.sorted.bam
├── salmon.merged.gene_counts.rdsMTEC_rep1.markdup.sorted.bam.bai
├── salmon.merged.gene_counts_scaled.rdsMTEC_rep2
├── salmonMTEC_rep2.merged.gene_counts_scaled.tsvmarkdup.sorted.bam
├── salmon.merged.gene_counts.tsv  <---- We will use this feature counts file for DESeq2 expression analysis. 
MTEC_rep2.markdup.sorted.bam.bai
├── MTEC_rep3
├── MTEC_rep3.markdup.sorted.bam
├── MTEC_rep3.markdup.sorted.bam.bai
├── picard_metrics
├── qualimap
├── rseqc
├── salmon.merged.gene_counts_length_tpmscaled.tsvrds
├── salmon.merged.transcriptgene_counts_length_scaled.rdstsv
├── salmon.merged.transcriptgene_counts.tsvrds
├── salmon.merged.transcriptgene_counts_tpmscaled.tsvrds
├── salmon_tx2gene.merged.gene_counts_scaled.tsv
├── samtools_stats
└── stringtie

...

salmon.merged.gene_counts.tsv

...

Code Block
head salmon.merged.gene_counts.tsv

the above command will print:

Code Block
gene_id gene_name       CD49fmNGFRm_rep1        CD49fmNGFRm_rep2        CD49fmNGFRm_rep3        CD49fpNGFRp_rep1        CD49fpNGFRp_rep2        CD49fpNGFRp_rep3     MTEC_rep1       MTEC_rep2       MTEC_rep3
ENSMUSG00000000001      Gnai3   2460    2395    2749    2686    3972    4419    7095    4484    6414
ENSMUSG00000000003      Pbsn    0       0       0       0       0       0       0       0       0
ENSMUSG00000000028      Cdc45   43      57      55      79      87.999  89      1241    830     1041.999
ENSMUSG00000000031      H19     2       0       1       17.082  24      16.077  200     139     145.604
ENSMUSG00000000037      Scml2   8       8       16      23      29.001  29      69      57      67
ENSMUSG00000000049      Apoh    1       0       2       1       2       0       0       0       2
ENSMUSG00000000056      Narf    522     496     539     368     457     538     1939    1483    1734
ENSMUSG00000000058      Cav2    1352.999        1349    1371.999        2684.001        4370    4386    6018.999        3429    5501
ENSMUSG00000000078      Klf6    4411    3492    4500    3221    3989    4637    3812    2741    3558

Now let’s copy the ‘salmon.merged.gene_counts.tsv’ file to your laptop/desktop using the file finder.

  • Windows - in the file finder type: \\hpc-fs\work\

  • Mac - open ‘Finder’ → press ‘command + k’ → type: ‘smb://hpc-fs/work/

Now let’s find the full path to the ‘salmon.merged.gene_counts.tsv’ file:

  • Windows:

  • Mac:

    • cd /folder/that/contains/feature_counts/

    • pwd

  • Rstudio:

    • Open Rstudio, go to the top bar a click on “Session” → “Select working directory: → “Choose directory

    • The path to the directory will be printed in Rstudio console, copy and paste in line 1 of the script ‘RNAseq_DESeq2_analysis.R’ (see below)

Differential Expression Analysis using DESeq2

We will now perform the following tasks using Rstudio:

  • install R package manager

  • install R packages (only needed once) - After installation we only need to load the packages

  • set working directory (directory where the gene counts file is located)

  • run DESeq2 analysis, export results and generate plots

Open R Studio and if not prompted automatically, create a R script file as follows:

Open Rstudio → click on ‘File’ in the top bar → select “New File” → select “R script

Install R package manager

We will use Bioconductor to install the “BiocManager”. Copy and paste the following code into the R studio script. Select all three lines in the R script and click on “Run”. Note: version 3.17 requires a R version 4.3 (which.

Code Block
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install(version = "3.17")

Install / Load packages:

Code Block
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("DESeq2")

#load packages
library("DESeq2")
library(ggplot2)
library("ggrepel")
library("EnhancedVolcano")

#install.packages("dendsort")
library(dendsort)
library("reshape2")
library("plyr")
library(Cairo)
library(pheatmap)
library("dendextend")
require(reshape2)

Prepare INPUT files

  • INPUT 1: Create a folder called ‘data’ and copy the gene counts file to a folder. For example, the gene counts table looks like this:

Code Block
gene_id gene_name       CD49fmNGFRm_rep1        CD49fmNGFRm_rep2        CD49fmNGFRm_rep3        CD49fpNGFRp_rep1        CD49fpNGFRp_rep2        CD49fpNGFRp_rep3     MTEC_rep1       MTEC_rep2       MTEC_rep3
ENSMUSG00000000001      Gnai3   2460    2395    2749    2686    3972    4419    7095    4484    6414
ENSMUSG00000000003      Pbsn    0       0       0       0       0       0       0       0       0
ENSMUSG00000000028      Cdc45   43      57      55      79      87.999  89      1241    830     1041.999
ENSMUSG00000000031      H19     2       0       1       17.082  24      16.077  200     139     145.604
ENSMUSG00000000037      Scml2   8       8       16      23      29.001  29      69      57      67
ENSMUSG00000000049      Apoh    1       0       2       1       2       0       0       0       2
ENSMUSG00000000056      Narf    522     496     539     368     457     538     1939    1483    1734
ENSMUSG00000000058      Cav2    1352.999        1349    1371.999        2684.001        4370    4386    6018.999        3429    5501
ENSMUSG00000000078      Klf6    4411    3492    4500    3221    3989    4637    3812    2741    3558
ENSMUSG00000000085      Scmh1   1561.002        1549.002        1694.001        2276.001        2663    3290.999        2437.001        1759.001    2325.002
ENSMUSG00000000088      Cox5a   1136    1425    1691    947     1410    1613.998        3173.999        1873.992        2865.998
ENSMUSG00000000093      Tbx2    74      40      66      354     380     460     444     295     436
ENSMUSG00000000094      Tbx4    1       0       0       0       0       0       1       1       1
ENSMUSG00000000103      Zfy2    0       0       0       0       0       0       0       0       0
ENSMUSG00000000120      Ngfr    379     313     363     6577    7083    8670    4050    3336    3534
ENSMUSG00000000125      Wnt3    11      19      21      1       0       0       3       0       3
ENSMUSG00000000126      Wnt9a   69      104     89      405     480     554     1663    1012    1730
ENSMUSG00000000127      Fer     696     776     898     558     777     958     911     700     892
ENSMUSG00000000131      Xpo6    2206    2172    2408    2402    2866    3327.002        3919    3204    3763
  • INPUT 2: Sample metadata:

Code Block
sample_name	sample_ID	group
CD49fmNGFRm_rep1	DC1	Differentiated_cells
CD49fmNGFRm_rep2	DC2	Differentiated_cells
CD49fmNGFRm_rep3	DC3	Differentiated_cells
CD49fpNGFRp_rep1	BC1	Basal_cells
CD49fpNGFRp_rep2	BC2	Basal_cells
CD49fpNGFRp_rep3	BC3	Basal_cells
MTEC_rep1	mTEC1	Murine_tracheal_epithelial_cell
MTEC_rep2	mTEC2	Murine_tracheal_epithelial_cell
MTEC_rep3	mTEC3	Murine_tracheal_epithelial_cell

R script - run the script using Rstudio

Code Block
##STEP1: import gene counts
countData <- read.table('salmon.merge.gene_counts.tsv', header = TRUE, sep = "\t")
head(countData)

##STEP2: import metaData
metaData <- read.csv('metadata.tsv*', header = TRUE, sep = ",")
metaData

##STEP3: Construct DESEQDataSet Object
dds <- DESeqDataSetFromMatrix(countData=countData, 
                              colData=metaData, 
                              design=~celltype, tidy = TRUE)

#define the reference group set as the first mentioned in levels
dds$celltype <- factor(dds$celltype, levels = c("CON","TB"))

##STEP4: Run DESEQ function
dds <- DESeq(dds)


##STPE5: Take a look at the results table
res <- results(dds)
head(results(dds, tidy=TRUE)) #let's look at the results table


##STEP6: Summary of differential gene expression
summary(res) #summary of results


##STEP7: Sort summary list by p-value
res <- res[order(res$padj),]
head(res)
head(res, 9)

##STEP8: save DEGs to file
write.table(res, file = "NoHeadInjury_CON__vs__HeadInjury_TB__DESeq2_toptags_DEGs.txt")


##STEP9: save normalised counts to file
foo <- counts(dds, normalized = TRUE)
write.csv(foo, file="NoHeadInjury_CON__vs__HeadInjury_TB__DESeq2_toptags_DEGs__Normalised_Counts.csv")


##STEP10: PCA
#version 1
pdf(file="Figure_NoHeadInjury_CON__vs__HeadInjury_TB__PCA.pdf")
vsdata <- vst(dds, blind=TRUE, nsub = 100)
plotPCA(vsdata, intgroup="dex") 
dev.off()

#version 2:
pdf(file="Figure_NoHeadInjury_CON__vs__HeadInjury_TB_label.pdf")
p <- plotPCA(vsdata, intgroup="dex") 
p + geom_text_repel(aes(label=substr(name, start = 1, stop = 5)),size = 2, min.segment.length = 0, seed = 42, box.padding = 0.5, max.overlaps = 20)
dev.off()

## jpeq
library(Cairo)

Cairo(file="Figure_NoHeadInjury_CON__vs__HeadInjury_TB__PCA_label.jpeg",
      type="png",
      units="px", 
      width=2500, 
      height=2000, 
      pointsize=12, 
      dpi=300)
p <- plotPCA(vsdata, intgroup="dex") 
p + geom_text_repel(aes(label=substr(name, start = 1, stop = 5)),size = 2, min.segment.length = 0, seed = 42, box.padding = 0.5, max.overlaps = 20)
dev.off()


#STEP11: VOLCANO PLOTS
#basic volcano plot 
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'pvalue')

#save enhanced volcano plot to file
pdf(file="Figure_NoHeadInjury_CON__vs__HeadInjury_TB__enhanced_volcano_plot_2.0FC_p1e-2.pdf")
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'pvalue',
                title = 'NoHeadInjury_CON__vs__HeadInjury_TB',
                pCutoff = 1e-2,
                FCcutoff = 1.5,
                pointSize = 1.0,
                labSize = 3.5)
dev.off()

pdf(file="Figure_NoHeadInjury_CON__vs__HeadInjury_TB__enhanced_volcano_plot_2.0FC_p10e-3.pdf")
EnhancedVolcano(res,
                      lab = rownames(res),
                      x = "log2FoldChange",
                      y = "pvalue",
                      pCutoff = 10e-3,
                      FCcutoff = 2,
                      ylim = c(0, -log10(10e-12)),
                      pointSize = c(ifelse(res$log2FoldChange>2, 2, 2)),
                      labSize = 3.0,
                      #shape = c(6, 6, 19, 16),
                      title = "DESeq2: NoHeadInjury_CON vs. HeadInjury_TB",
                      subtitle = "Differential expression",
                      caption = bquote(~Log[2]~ "fold change cutoff, 2; p-value cutoff, 10e-3"),
                      #legendPosition = "right",
                      legendLabSize = 8,
                      col = c("grey30", "forestgreen", "royalblue", "red2"),
                      colAlpha = 0.9,
                      drawConnectors = TRUE,
                      hline = c(10e-3),
                      widthConnectors = 0.5)
dev.off()


#save as high resolution jpeg
library(Cairo)

## Initialize the device for saving first plot.
## Plot will have length=500 pixel, width=500 pixel

Cairo(file="Figure_NoHeadInjury_CON__vs__HeadInjury_TB__enhanced_volcano_plot_2.0FC_p1e-3.jpeg",
      type="png",
      units="px", 
      width=2800, 
      height=2500, 
      pointsize=12, 
      dpi=300)
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'pvalue',
                title = 'NoHeadInjury_CON__vs__HeadInjury_TB',
                pCutoff = 1e-3,
                FCcutoff = 2.0,
                pointSize = 1.0,
                labSize = 3.0)
dev.off()


#option2:
Cairo(file="Figure_NoHeadInjury_CON__vs__HeadInjury_TB__enhanced_volcano_plot_2.0FC_p10e-3_v2.jpeg",
      type="png",
      units="px", 
      width=2500, 
      height=2200, 
      pointsize=14, 
      dpi=300)
EnhancedVolcano(res,
                lab = rownames(res),
                x = "log2FoldChange",
                y = "pvalue",
                pCutoff = 10e-3,
                FCcutoff = 2,
                ylim = c(0, -log10(10e-12)),
                pointSize = c(ifelse(res$log2FoldChange>2, 2, 2)),
                labSize = 3.5,
                #shape = c(6, 6, 19, 16),
                title = "DESeq2: NoHeadInjury_CON vs. HeadInjury_TB",
                subtitle = "Differential expression",
                caption = bquote(~Log[2]~ "fold change cutoff, 2; p-value cutoff, 10e-3"),
                #legendPosition = "right",
                legendLabSize = 8,
                col = c("grey30", "forestgreen", "royalblue", "red2"),
                colAlpha = 0.9,
                drawConnectors = TRUE,
                hline = c(10e-3),
                widthConnectors = 0.5)
dev.off()




#save as high res tiff

tiff(file="Figure_NoHeadInjury_CON__vs__HeadInjury_TB__enhanced_volcano_plot_2.0FC_p1e-5.tiff",
     width=8, height=8, units="in", res=150)
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'pvalue',
                title = 'NoHeadInjury_CON__vs__HeadInjury_TB',
                pCutoff = 1e-5,
                FCcutoff = 2.0,
                pointSize = 1.5,
                labSize = 3.5)

dev.off()

A copy of the script is at /demo/run3_full_pipeline/

RNAseq_DESeq2.R script:

Code Block

# Differential expression analysis

# When you see '## USER INPUT', this means you have to modify the code for your computer or dataset. All other code can be run as-is (i.e. you don't need to understand the code, just run it)

#### Installing required packages ####

# **NOTE: this section only needs to be run once (or occasionally to update the packages)
# Install devtools
install.packages("devtools", repos = "http://cran.us.r-project.org")
# Install R packages. This only needs to be run once.
bioconductor_packages <- c("DESeq2", "EnhancedVolcano")
cran_packages <- c("ggrepel", "ggplot2", "plyr", "reshape2", "readxl", "FactoMineR", "factoextra")
# Compares installed packages to above packages and returns a vector of missing packages
new_packages <- bioconductor_packages[!(bioconductor_packages %in% installed.packages()[,"Package"])]
new_cran_packages <- cran_packages[!(cran_packages %in% installed.packages()[,"Package"])]
# Install missing bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install(new_packages)
# Install missing cran packages
if (length(new_cran_packages)) install.packages(new_cran_packages, repos = "http://cran.us.r-project.org")
# Update all installed packages are the latest version
update.packages(bioconductor_packages, ask = FALSE)
update.packages(cran_packages, ask = FALSE, repos = "http://cran.us.r-project.org")


#### Loading required packages ####

# This section needs to be run very time
# Load packages
bioconductor_packages <- c("DESeq2", "EnhancedVolcano")
cran_packages <- c("ggrepel", "ggplot2", "plyr", "reshape2", "readxl", "FactoMineR", "factoextra")
lapply(cran_packages, require, character.only = TRUE)
lapply(bioconductor_packages, require, character.only = TRUE)


#### Import your count data ####

# Make sure you have: a) your count table (salmon.merged.gene_counts.tsv file, if you used Nextflow nfcore/rnaseq to analyse your data). Copy this to a subdirectory called 'data'. b) your metadata file. This should be either an Excel file called 'metadata.xlsx' or a tab-separated text file called 'metadata.txt'. It needs 3 columns called 'sample_name', 'sample_ID' and 'group'. The sample names should be EXACTLY the same as the names in the count table. These names are often uninformative and long, so the 'sample_ID' is the sample labels you want to put on your plots. E.g. if you have a 'high fat' group, you might want to rename the samples HF1, HF2, HF3, etc)

## USER INPUT
# Set working directory. 
# Change this to your working directory (In the RStudio menu: Session -> Set working directory -> Choose working directory)
setwd("C:/Users/whatmorp/OneDrive - Queensland University of Technology/Desktop/Projects/RNA-Seq downstream analysis")

# Import your count data. make sure you've created a 'data' subdirectory and put the count table file there.
metacountdata <- read.table("./data/salmon.merged.gene_counts.tsv", header = TRUE, row.names = 1)

# Import metadata (option 1: text file). Again, need a metadata.txt file in the data subdirectory.
meta <- read.table("./data/metadata.txt", header = T)
# Import metadata (option 1: Excel file))
#meta <- read_excel("./data/metadata.xlsx")

# Remove 1st columns of metadata (gene_name)
counts <- metacountdata[ ,2:ncol(metacountdata)]

# Rename sample names to new sample IDs
counts <- counts[as.character(meta$sample_name)]
colnames(counts) <- meta$sample_ID

# Counts need to be rounded to integers
counts <- ceiling(counts)


#### Outliers and batch effects ####

# This section normalises and transforms the count data so that it can be plotted on a PCA plot and a heatmap

# Normalise counts by library size ##
# Using DeSeq2's estimateSizeFactors(). Note that DeSeq2 does this internally during DEG calling. The normalisation below is done separately for PCA and density plotting.

## USER INPUT
# Choose the groups you want to plot in a PCA/Heatmap. You can select any 2 or more of the groups (or all of the groups) you have in your 'groups' column of your metadata table
# To see what groups are present, run the following:
unique(meta$group)
# Now add which groups you want to plot (i.e. replace the groupnames below, and add more, separated by a comma and in "quotes", as needed). NOTE: R is case-sensitive, so these group names must be named EXACTLY the same as in the metadata table.
plotgroups <- c("Differentiated_cells", "Basal_cells")

# Pull out only the counts from the above groups
groupcounts <- counts[meta$group %in% plotgroups]

# Set up the initial DeSeq2 experimental parameters. This is the length of the no' of samples being analysed. In full DeSeq2 analysis this is to separate groups as factors, but here a simple count is fine.
condition <- factor(1:length(groupcounts))
# Set up the column data. A data frame of sample ID's and conditions
coldata <- data.frame(row.names=colnames(groupcounts), condition)
# Set up the DeSeq2 data set structure
f <- DESeqDataSetFromMatrix(countData = groupcounts, colData = coldata, design= ~ condition)
# Estimate the size factors. See DeSeq2 manual for details
f <- estimateSizeFactors(f)
# Size factors can be viewed by: sizeFactors(f)

# Multiply each row (sample) by the corresponding size factor
subcount_norm <- as.matrix(groupcounts) %*% diag(sizeFactors(f))

# Re-add column names
colnames(subcount_norm) <- colnames(groupcounts)

## Remove low coverage transcripts (mean count < 10) ##

# Find the mean of each row (and output as a data frame object)
means <- as.data.frame(rowMeans(subcount_norm))

# Then join the means data with the counts
means <- cbind(means, subcount_norm)

# Then subset out only genes with mean > 10
data <- subset(means, means[ , 1] > 10)

# Remove the means column
data <- data[,-1]

# Transform data
data_log <- vst(round(as.matrix(data)))

# Transformation can create some infinite values. Can't generate PCA data on these. Can see how many by: sum(sapply(data_log, is.infinite))
# To remove infinite rows, use 'is.finte' or '!is.infinite'
data_log <- data_log[is.finite(rowSums(data_log)),]
colnames(data_log) <- colnames(groupcounts)

### Set up the PCA plot base data ###

# We're using the FactoMineR package to generate PCA plots (http://factominer.free.fr/index.html)

# Need to transpose the data first
data_log_t <- t(data_log)

# Add the group data
data_log_t_vars <- data.frame(meta$group[meta$group %in% plotgroups], data_log_t)

# Generate the PCA data using FactoMineR package
res.pca <- PCA(data_log_t_vars, quali.sup = 1, graph=FALSE)

## Set up the dendogram/heatmaps base data ##

# Calculate the distance matrix:
distance_matrix <- as.matrix(dist(t(data_log)))


#### PCA plot ####

# Generate the PCA plot. Groups are shaded with elipses at 95% confidence level. NOTE: at least 4 replicates need to be in a group for an elipses to be drawn.
p <- fviz_pca_ind(res.pca,
                  geom.ind = "point", # show points only (but not "text")
                  col.ind = meta$group[meta$group %in% plotgroups], # color by groups
                  pointsize = 5, title = "",
                  addEllipses = TRUE, ellipse.type = "t", ellipse.level = 0.95) + theme(legend.text = element_text(size = 12), legend.title = element_text(size = 14), axis.title=element_text(size=16), axis.text=element_text(size=14))

p

## USER INPUT. Change these to 
pdf_filename <- "pca.pdf"
tiff_filename <- "pca.tiff"

# Output as publication quality (300dpi) tiff and pdf (Note: this will create a 'figures subdirectory where all figures will be output)
dir.create("figures", showWarnings = FALSE)
ggsave(file = paste0("./figures/", tiff_filename), dpi = 300, compression = "lzw", device = "tiff", width = 10, height = 8, plot = p)
ggsave(file = paste0("./figures/", pdf_filename), device = "pdf", width = 10, height = 8, plot = p)


#### Heatmap and dendrogram ####











...

  <---- We will use this feature counts file for DESeq2 expression analysis. 
├── salmon.merged.gene_tpm.tsv
├── salmon.merged.transcript_counts.rds
├── salmon.merged.transcript_counts.tsv
├── salmon.merged.transcript_tpm.tsv
├── salmon_tx2gene.tsv
├── samtools_stats
└── stringtie

The expression count file that we are interested is salmon.merged.gene_counts.tsv

Code Block
head salmon.merged.gene_counts.tsv

Differential Expression Analysis using DESeq2

We will now perform the following tasks using Rstudio:

  1. Preparing your data. Two data files needed for this analysis: a samples table and your count table

  2. Install required R packages (only need to run once) - after installation we only need to load the packages. NOTE: If using an rVDI virtual machine, the R packages are already installed

  3. Load required R packages. Unlike installing the packages, this needs to be done every time you run the analysis

  4. Import your data files (count table and samples table)into R

  5. Checking for outliers and batch effects

    1. PCA plot

    2. Pairwise samples heatmap

  6. Identify differentially expressed (DE) genes using DESeq2

    1. Annotating your DE genes

    2. Volcano plot

    3. DE genes heatmap

1. Preparing your data

You will need 2 data files to complete this analysis: your count table (see above) and a samples table.

a. First create a new folder in H:\workshop\RNAseq . Call it something suitable, such as ‘DE_analysis_workshop’

b. Create a sub folder here called ‘data’. This is where your two data files will be stored

c. Copy the count table (the ‘salmon.merged.gene_counts.tsv' file) to the 'data’ folder you just created.

The count table looks like this:

Code Block
gene_id	gene_name	SRR20622172	SRR20622173	SRR20622174	SRR20622175	SRR20622176	SRR20622177	SRR20622178	SRR20622179	SRR20622180
ENSMUSG00000000001	Gnai3	7086	4470	2457.002	2389	6398	2744	2681	3961	4399
ENSMUSG00000000003	Pbsn	0	0	0	0	0	0	0	0	0
ENSMUSG00000000028	Cdc45	1232.999	827	42	57	1036	55	78	88	89
ENSMUSG00000000031	H19	200	139	2	0	143.622	1	17.082	24	16.077
ENSMUSG00000000037	Scml2	70	57.001	8	8	66.999	16	23	27.999	29
ENSMUSG00000000049	Apoh	0	0	1	0	2	2	1	3	0
ENSMUSG00000000056	Narf	1933	1480	519	497	1730	539	365	458	536
ENSMUSG00000000058	Cav2	6008	3417	1347.001	1344	5482	1367	2669.001	4358	4365.832
ENSMUSG00000000078	Klf6	3809	2732	4413.001	3483.978	3559	4491	3209	3980	4626

d. In the same W:\training\rnaseq\runs\run3_RNAseq\results\star_salmon directory there will be a file called metadata.xlsx . Copy this file to your ‘data’ folder as well. This file will normally need to be manually created by you to match your sample IDs and treatment groups, but we created this file already for you to use. This samples table needs 3 columns called ‘sample_name’, containing the sample names seen in the count table (column names), ‘sample_ID’, which is the (less messy) names you want to call the samples in this analysis workflow, and ‘group’, which contains the treatment groups each sample belongs to. The contents of this file look like this:

Code Block
sample_name	sample_ID	group
SRR20622174	DC1 Differentiated_cells
SRR20622175	DC2	Differentiated_cells
SRR20622177	DC3	Differentiated_cells
SRR20622178	BC1	Basal_cells
SRR20622179	BC2	Basal_cells
SRR20622180	BC3	Basal_cells
SRR20622172	mTEC1	Murine_tracheal_epithelial_cell
SRR20622173	mTEC2	Murine_tracheal_epithelial_cell
SRR20622176	mTEC3	Murine_tracheal_epithelial_cell

e. Open RStudio and create a new R script (‘File’ → “New File” → “R script”). Now hit ‘File’ → ‘Save’ and save the script in the analysis workshop folder you created in step a. (NOT IN THE ‘data’ FOLDER). Give the script file a name (e.g. DESEq2.R).

The following analysis contains R code (in the grey text boxes) that you can copy and paste, then run, into the R script you just created.

2. Install required R packages

Copy and paste the following code into the R script you just created, then run the code (highlight all the code in your R script, then press the run button). This will install all the required packages and dependencies and may take 30 minutes or more to complete. It may prompt you occasionally to update packages - select 'a' for all if/when this occurs.

...

NOTE: you only need to run this section once on any laptop/PC, and you don’t need to run it if you’re using an rVDI machine.

Code Block
#### Differential expression analysis ####

# When you see '## USER INPUT', this means you have to modify the code for your computer or dataset. All other code can be run as-is (i.e. you don't need to understand the code, just run it)

#### 2. Installing required packages ####

# **NOTE: this section only needs to be run once (or occasionally to update the packages)
# Install devtools
install.packages("devtools", repos = "http://cran.us.r-project.org")
# Install R packages. This only needs to be run once.
bioconductor_packages <- c("DESeq2", "EnhancedVolcano", "org.Hs.eg.db", "org.Mm.eg.db", "org.Rn.eg.db", "org.EcK12.eg.db", "org.EcSakai.eg.db", "org.Dr.eg.db", "org.Dm.eg.db")
cran_packages <- c("ggrepel", "ggplot2", "plyr", "reshape2", "readxl", "FactoMineR", "factoextra", "pheatmap")
# Compares installed packages to above packages and returns a vector of missing packages
new_packages <- bioconductor_packages[!(bioconductor_packages %in% installed.packages()[,"Package"])]
new_cran_packages <- cran_packages[!(cran_packages %in% installed.packages()[,"Package"])]
# Install missing bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install(new_packages)
# Install missing cran packages
if (length(new_cran_packages)) install.packages(new_cran_packages, repos = "http://cran.us.r-project.org")
# Update all installed packages to the latest version
update.packages(bioconductor_packages, ask = FALSE)
update.packages(cran_packages, ask = FALSE, repos = "http://cran.us.r-project.org")

3. Load required R packages

This section loads the packages you’ve installed in the previous section. Unlike installing packages, this needs to be run every time and should only take a few seconds to run.

Code Block
#### 3. Loading required packages ####

# This section needs to be run every time
# Load packages
bioconductor_packages <- c("DESeq2", "EnhancedVolcano", "org.Hs.eg.db", "org.Mm.eg.db", "org.Rn.eg.db", "org.EcK12.eg.db", "org.EcSakai.eg.db", "org.Dr.eg.db", "org.Dm.eg.db")
cran_packages <- c("ggrepel", "ggplot2", "plyr", "reshape2", "readxl", "FactoMineR", "factoextra", "pheatmap")
lapply(cran_packages, require, character.only = TRUE)
lapply(bioconductor_packages, require, character.only = TRUE)

4. Import your data files into R

In this section we will import your count table and samples table into R.

You’ll need to change the ‘setwd' line to your working directory. Click ‘Session’ → ‘Set working directory’ → ‘Choose working directory’ and then choose the analysis workshop directory you created previously, that contains your R script file and the 'Data’ directory.

Code Block
#### 4. Import your count data ####

# Make sure you have: a) your count table (salmon.merged.gene_counts.tsv file, if you used Nextflow nfcore/rnaseq to analyse your data). Copy this to a subdirectory called 'data'. b) your metadata file. This should be either an Excel file called 'metadata.xlsx' or a tab-separated text file called 'metadata.txt'. It needs 3 columns called 'sample_name', 'sample_ID' and 'group'. The sample names should be EXACTLY the same as the names in the count table. These names are often uninformative and long, so the 'sample_ID' is the sample labels you want to put on your plots. E.g. if you have a 'high fat' group, you might want to rename the samples HF1, HF2, HF3, etc)

## USER INPUT
# Set working directory. 
# Change this to your working directory (In the RStudio menu: Session -> Set working directory -> Choose working directory)
setwd("C:/Users/whatmorp/OneDrive - Queensland University of Technology/Desktop/Projects/RNA-Seq downstream analysis")

# Import your count data. make sure you've created a 'data' subdirectory and put the count table file there.
metacountdata <- read.table("./data/salmon.merged.gene_counts.tsv", header = TRUE, row.names = 1)

# Import metadata. Again, need a metadata.xlsx file in the data subdirectory.
meta <- read_excel("./data/metadata.xlsx")

# Remove 1st columns of metadata (gene_name)
counts <- metacountdata[ ,2:ncol(metacountdata)]

# Rename sample names to new sample IDs
counts <- counts[as.character(meta$sample_name)]
colnames(counts) <- meta$sample_ID

# Counts need to be rounded to integers
counts <- ceiling(counts)

5. Checking for outliers and batch effects

In this section we will create PCA plots and heatmaps to examine the relationships between samples. Outlier samples and batch effects can heavily bias your results and should be addressed (e.g. removal of outlier samples from the dataset) before any differential expression analysis is completed.

First we need to prepare the data for plotting. Copy, paste and run the following code in your R script. you will need to input which treatment groups you wish to plot (plotgroups <-) from the set of available treatment groups (which you can find out with unique(meta$group).

Code Block
#### 5. Outliers and batch effects ####

# This section normalises and transforms the count data so that it can be plotted on a PCA plot and a heatmap

## USER INPUT
# Choose the groups you want to plot in a PCA/Heatmap. You can select any 2 or more of the groups (or all of the groups) you have in your 'groups' column of your metadata table
# To see what groups are present, run the following:
unique(meta$group)
# Now add which groups you want to plot (i.e. replace the groupnames below, and add more, separated by a comma and in "quotes", as needed). NOTE: R is case-sensitive, so these group names must be named EXACTLY the same as in the metadata table.
plotgroups <- c("Differentiated_cells", "Basal_cells")

# Pull out only the counts from the above groups
groupcounts <- counts[meta$group %in% plotgroups]

# Normalise counts by library size, using DeSeq2's estimateSizeFactors() function. Note that DeSeq2 does this internally during DEG calling. The normalisation below is done separately for PCA and density plotting.
# Set up the initial DeSeq2 experimental parameters.
condition <- factor(1:length(groupcounts))
# Set up the column data. A data frame of sample ID's and conditions
coldata <- data.frame(row.names=colnames(groupcounts), condition)
# Set up the DeSeq2 data set structure
f <- DESeqDataSetFromMatrix(countData = groupcounts, colData = coldata, design= ~ condition)
# Estimate the size factors. See DeSeq2 manual for details
f <- estimateSizeFactors(f)
# Size factors can be viewed by: sizeFactors(f)

# Multiply each row (sample) by the corresponding size factor
subcount_norm <- as.matrix(groupcounts) %*% diag(sizeFactors(f))
# Re-add column names
colnames(subcount_norm) <- colnames(groupcounts)

## Remove low coverage transcripts (mean count < 10) ##

# Find the mean of each row (and output as a data frame object)
means <- as.data.frame(rowMeans(subcount_norm))
# Then join the means data with the counts
means <- cbind(means, subcount_norm)
# Then subset out only genes with mean > 10
data <- subset(means, means[ , 1] > 10)
# Remove the means column
data <- data[,-1]

# Transform data
data_log <- vst(round(as.matrix(data)))
# Transformation can create some infinite values. Can't generate PCA data on these. Can see how many by: sum(sapply(data_log, is.infinite))
# To remove infinite rows, use 'is.finte' or '!is.infinite'
data_log <- data_log[is.finite(rowSums(data_log)),]
colnames(data_log) <- colnames(groupcounts)

### Set up the PCA plot base data ###

# We're using the FactoMineR package to generate PCA plots (http://factominer.free.fr/index.html)

# Need to transpose the data first
data_log_t <- t(data_log)
# Add the group data
data_log_t_vars <- data.frame(meta$group[meta$group %in% plotgroups], data_log_t)
# Generate the PCA data using FactoMineR package
res.pca <- PCA(data_log_t_vars, quali.sup = 1, graph=FALSE)

## Set up the dendogram/heatmaps base data ##

# Calculate the distance matrix:
distance_matrix <- as.matrix(dist(t(data_log)))

5a. PCA plot

Now you can run the following code in your R script to generate the PCA plot.

We will be using the factoextra package to generate both the PCA data and the PCA plot.

Code Block
#### 5a. PCA plot ####

# Generate the PCA plot. Groups are shaded with ellipses at 95% confidence level. NOTE: at least 4 replicates need to be in a group for an ellipses to be drawn.
# NOTE: change the group point colours by changing 'palette = ' below. Use the 'RColourBrewer' colour names (https://r-graph-gallery.com/38-rcolorbrewers-palettes.html). For example, if you are plotting 3 groups and choose palette = "Set1", this will use the first 3 colours from the Set1 colour palette.
p <- fviz_pca_ind(res.pca,
                  geom.ind = "point", # show points only (but not "text")
    

...

              col.ind = meta$group[meta$group %in% plotgroups], # color by groups
         

...

 

...

 

...

 

...

  

...

 

...

 

...

 

...

 

...

pointsize 

...

= 

...

5, 

...

title 

...

= 

...

"", legend.title = "Treatment groups", palette = 

...

"

...

Dark2",

...


   

...

 

...

 

...

 

...

 

...

 

...

 

...

   

...

 

...

 

...

 

...

 

...

 

...

 

...

addEllipses 

...

= 

...

TRUE, ellipse.type = "t", ellipse.level = 0.95) + theme(legend.text = element_text(size = 12), legend.title = element_text(size = 14), axis.title=element_text(size=16), axis.text=element_text(size=14))

p

# Output as publication quality (300dpi) tiff and pdf.
# This will name your output files with the treatment groups you selected.

# Create a 'figures' subdirectory where all figures will be output
dir.create("figures", showWarnings = FALSE)
# Create a (300dpi) tiff
ggsave(file = paste0("./figures/PCA_", paste(plotgroups, collapse = "_Vs_"), ".tiff"), dpi = 300, compression = "lzw", device = "tiff", width = 10, height = 8, plot = p)
# Create a pdf
ggsave(file = paste0("./figures/PCA_", paste(plotgroups, collapse = "_Vs_"), ".pdf"), device = "pdf", width = 10, height = 8, plot = p)

5b. Pairwise samples heatmap

Now generate the heatmap and dendrogram.

We will be using the pretty heatmap package to accomplish this.

Code Block
#### 5b. Samples heatmap and dendrogram ####

# This section plots a heatmap and dendrogram of pairwise relationships between samples. In this way you can see if samples cluster by treatment group.

# See here: https://davetang.org/muse/2018/05/15/making-a-heatmap-in-r-with-the-pheatmap-package/

# Define annotation column
annot_columns <- data.frame(meta$group[meta$group %in% plotgroups])
# Make the row names the sample IDs
row.names(annot_columns) <- meta$sample_ID[meta$group %in% plotgroups]
colnames(annot_columns) <- "Treatment groups"
# Need to factorise it
annot_columns[[1]] <- factor(annot_columns[[1]])

# Generate dendrogram and heatmap
pheatmap(distance_matrix, color=colorRampPalette(c("white", "#9999FF", "#990000"))(50), cluster_rows = TRUE, show_rownames = TRUE, treeheight_row = 0, treeheight_col = 70, fontsize_col = 12, annotation_names_col = F, annotation_col = annot_columns, filename = paste0("./figures/Pairwise_sample_heatmap_", paste(plotgroups, collapse = "_Vs_"), ".tiff"))

# Notes about heatmap colours.
# You can change the colours used in the heatmap itself by changing the colour names (color=colorRampPalette....)
# If you want to change the annotation colours, see here: https://zhiganglu.com/post/pheatmap_change_annotation_colors/

6. Identify differentially expressed (DE) genes using DESeq2

Now we come to the main analysis section of this workflow, where we will identify differentially expressed genes. This will generate two important datapoints per gene, the log fold change, which shows the change in expression levels between two treatment groups for a specific gene, and the adjusted p value, which shows which genes are significantly differentially expressed (adjusted to remove false positives).

The R package we use to find DE genes is DESeq2, “Differential gene expression analysis based on the negative binomial distribution”.

DESeq2 estimates variance-mean dependence in count data from high-throughput sequencing assays and tests for differential expression based on a model using the negative binomial distribution.

Copy and run the below code into your R script.

You’ll need to choose two treatment groups you want to compare (degroups <-), from the list of treatment groups in your samples table.

Code Block
#### 6. Differential expression analysis ####

# In this section we use the Deseq2 package to identify differentially expressed genes.

## USER INPUT
# Choose the treatment groups you want to compare. 
# To see what groups are present, run the following:
unique(meta$group)
# Enter which groups you want to compare (two groups only). BASELINE OR CONTROL GROUP SHOULD BE LISTED FIRST.
degroups <- c("Basal_cells", "Differentiated_cells")

# From the count table, pull out only the counts from the above groups
expdata <- as.matrix(counts[,meta$group %in% degroups])

# Set up the experimental condition
# 'factor' sets up the reference level, i.e. which is the baseline group (otherwise the default baseline level is in alphabetic order)
condition <- factor(meta$group[meta$group %in% degroups], levels = degroups)
# Type 'condition' in the console to see is the levels are set correctly

# Set up column data (treatment groups and sample ID)
coldata <- data.frame(row.names=colnames(expdata), condition)

# Create the DESeq2 dataset (dds)
dds <- DESeq2::DESeqDataSetFromMatrix(countData=expdata, colData=coldata, design=~condition)
dds$condition <- factor(dds$condition, levels = degroups)

# Run DESeq2 to identify differentially expressed genes
deseq <- DESeq(dds)

# Extract a results table from the DESeq analysis
res <- results(deseq)

# Reorder results by adjusted p vales, so that the most signififcantly DE genes are at the top
res <- res[order(res$padj), ]

# You can do a summary of the results to see how many significantly (alpha=0.05, adjust to 0.01 if needed) upregulated and downregulated DE genes were found
summary(res, alpha=0.05)

# Convert from DESeq object to a data frame.
res <- data.frame(res)

# Look at the top 6 DE genes
head(res)

6a. Annotating your DE genes

The table of DE genes will have gene names according to the reference genome that was used to generate the count table. This is usually an Entrez gene ID if the NCBI genome was used, or an Ensemble gene ID if the Ensemble genome was used.

These Entrez/Ensembl gene IDs are a string of numbers that aren’t particularly informative. They can be individually looked to see what genes they represent, or we can annotate them all at the same time in R to match the Entrez/Ensembl gene IDs to their more commonly known gene symbol and description.

We'll be using bioconductor genome wide annotation packages to provide the annotation data. https://bioconductor.org/packages/3.17/data/annotation/

Annotation packages for human (org.Hs.eg.db), mouse (org.Mm.eg.db), rat (org.Rn.eg.db), E. coli strain K12 (org.EcK12.eg.db), E. coli strain Sakai (org.EcSakai.eg.db), zebrafish (org.Dr.eg.db) and Drosophila (org.Dm.eg.db) were installed with the required R packages. If your species is not in the above list, contact the eResearch team.

This workshop uses mouse data, so we will be using the org.Mm.eg.db annotation package here.

Code Block
#### 6a. Annotating your DE genes ####

# Annotation packages for human (org.Hs.eg.db), mouse (org.Mm.eg.db), rat (org.Rn.eg.db), E. coli strain K12 (org.EcK12.eg.db), E. coli strain Sakai (org.EcSakai.eg.db), zebrafish (org.Dr.eg.db) and Drosophila (org.Dm.eg.db) were installed with the required R packages. If your species is not in the above list, contact the eResearch team.

## USER INPUT
# Input your species genome below (select from the above list)
my_genome <- org.Mm.eg.db

# Pull out just the Entrez/Ensembl gene IDs
gene_ids <- row.names(res)

# You can see the list of annotations you can apply to your data by:
keytypes(my_genome)

# Annotate gene symbol and description to your DE gene IDs (you can add other keytypes from the above 'keytypes(my_genome)' list, if you choose)
cols <- c("SYMBOL", "GENENAME")

## USER INPUT
# Provide the gene ID type for your DE data
# If you have Ensemble IDs, enter "ENSEMBL", if you have Entrez IDs, enter "ENTREZID", if you have gene symbols, enter "SYMBOL"
idtype <- "ENSEMBL"

# Map the Entrez/Ensembl gene IDs to gene symbol and description
map <- AnnotationDbi::select(my_genome, keys=gene_ids, columns=cols, keytype=idtype)

# Combine the annotation data with the DE data
# Since we're matching Ensembl -> Entrez there aren't a 1:1 mapping, so need to merge rather than cbind
annot <- merge(x = res, y = map, by.x = 0, by.y = idtype, all = F)
# There isn't always a 1:1 mapping between gene identifiers, so we also need to remove duplicates
annot_nodups <- annot[!duplicated(annot$Row.names), ]
# Reorder by adjusted p
annot_nodups <- annot_nodups[order(annot_nodups$padj), ]

# Add normalised counts to the output table. This is so you can later plot expression trends for individual genes in R, Excel, etc.
# Need to normalise the counts first, using the size factors calculated by DESeq2 (in the 'deseq' object)
expdata_norm <- as.matrix(expdata) %*% diag(deseq$sizeFactor)
colnames(expdata_norm) <- colnames(expdata)
annot_counts <- merge(x = annot_nodups, y = expdata_norm, by.x = "Row.names", by.y = 0, all = TRUE)

# Export the full annotated dataset as a csv (for journal submission)
# Creates a 'Tables' subdirectory where all tables will be output
dir.create("Tables", showWarnings = FALSE)
write.csv(annot_counts, file=paste0("./Tables/all_genes_", paste(degroups, collapse = "_Vs_"), ".csv"), row.names = FALSE)

# Pull out just significant genes (change from 0.05 to 0.01 if needed)
DE_genes <- subset(annot_counts, padj < 0.05, select=colnames(annot_counts))
# NOTE: add lfc cutoffs if needed. E.g., log2FoldChange > 1 and < log2FoldChange -1 cutoff
# DE_genes <- subset(DE_genes, log2FoldChange > 1 | log2FoldChange < -1, select=colnames(DE_genes))

# Export the sig DE genes as a table (to be used in downstream analysis)
write.csv(DE_genes, file=paste0("./Tables/DE_genes_", paste(degroups, collapse = "_Vs_"), ".csv"), row.names = FALSE)

6b. Volcano plot

Now that we have our list of DE genes, we can visualise them with a volcano plot and heat map.

First, a volcano plot.

We’re using the EnhancedVolcano package ‘Publication-ready volcano plots with enhanced colouring and labeling’

The below plot only labels the top 20 DE genes, using mostly plot defaults. You can modify this code to label whatever genes you like, change plot colours, change axis labels, change log fold change cutoffs, etc, by following the EnhancedVolcano guide here:https://bioconductor.org/packages/release/bioc/html/EnhancedVolcano.html

Code Block
#### 6b. Volcano plot ####

p <- EnhancedVolcano(annot_nodups, lab = annot_nodups$SYMBOL, selectLab = annot_nodups$SYMBOL[1:20], drawConnectors = TRUE, title = NULL, subtitle = NULL, x = 'log2FoldChange', y = 'pvalue')

p

# NOTE: the above plot shows labels for the top significantly DE (i.e. by lowest adjusted p value) genes.

# Output as publication quality (300dpi) tiff and pdf.
# Create a (300dpi) tiff
ggsave(file = paste0("./figures/volcano_", paste(degroups, collapse = "_Vs_"), ".tiff"), dpi = 300, compression = "lzw", device = "tiff", width = 10, height = 8, plot = p)
# Create a pdf
ggsave(file = paste0("./figures/volcano_", paste(degroups, collapse = "_Vs_"), ".pdf"), device = "pdf", width = 10, height = 8, plot = p)

6c. DE genes heatmap

This section plots a heatmap and dendrogram of DE gene expression per sample. Expression counts are scaled and centered so that groupwise relationships can be examined.

Code Block
#### 6c. DE genes heatmaps and dendrograms ####

# Make the row names gene symbols.
DE_genes <- na.omit(DE_genes)
row.names(DE_genes) <- make.unique(DE_genes$SYMBOL)

# sort by p-value
DE_genes <- DE_genes[order(DE_genes$padj), ]

# Pull out normalised counts only
siggc <- DE_genes[colnames(DE_genes) %in% colnames(expdata)]

# Scale and center each row. This is important to visualise relative differences between groups and not have row-wise colouration dominated by high or low gene expression.
xts <- scale(t(siggc))
xtst <- t(xts)

# Define annotation column
annot_columns <- data.frame(meta$group[meta$group %in% degroups])
# Make the row names the sample IDs
row.names(annot_columns) <- meta$sample_ID[meta$group %in% degroups]
colnames(annot_columns) <- "Treatment groups"
# Need to factorise it
annot_columns[[1]] <- factor(annot_columns[[1]])

# Generate dendrogram and heatmap for ALL DE genes
pheatmap(xtst, color=colorRampPalette(c("#D55E00", "white", "#0072B2"))(100), annotation_col=annot_columns, annotation_names_col = F, fontsize_col = 12, fontsize_row = 7, labels_row = row.names(siggc), show_rownames = F, filename = paste0("./figures/All_DEG_Heatmap_", paste(plotgroups, collapse = "_Vs_"), ".tiff"))

# Generate dendrogram and heatmap for top 20 DE genes
pheatmap(xtst[1:20,], color=colorRampPalette(c("#D55E00", "white", "#0072B2"))(100), annotation_col=annot_columns, annotation_names_col = F, fontsize_col = 14, fontsize_row = 12, labels_row = row.names(siggc), show_rownames = T, filename = paste0("./figures/Top_DEG_Heatmap_", paste(plotgroups, collapse = "_Vs_"), ".tiff"))
# NOTE: you can plot more than 20 top genes by adjusting 'xtst[1:20,]'. If you wanted to plot the top 50 genes you'd change this to 'xtst[1:50,]'