Task 4 : Differential expression analysis using Rstudio

Aim:

Identify statistically significant (FDR < 0.05) differentially expressed genes.

Requirements

Run your samples (FASTQ) using the nextflow nf-core/RNA-seq pipeline using ‘star_salmon’ (Task 3 session) or an alternative pipeline that generates feature counts.

Installing R and Rstudio

You will need to have both R and Studio installed in your laptop/desktop for today’s exercises.

Go to the following page https://posit.co/download/rstudio-desktop/ and follow instructions provided to install first R and then Rstudio.

Install R:

https://cran.r-project.org/bin/windows/base/

install Rstudio:

https://posit.co/download/rstudio-desktop/

RNAseq gene expression feature counts

Go to the /sandpit/demo/run3_full_pipeline/ folder that contains the results for running the complete RNAseq pipeline. The output folders from task 3 look like this:

├── results
│   ├── fastqc
│   ├── multiqc
│   ├── pipeline_info
│   ├── star_salmon
│   └── trimgalore

The gene expression feature counts can be found in the /results/star_salmon/ folder. Let’s access the folder (i.e., cd /run3/results/star_salmon). A list of files and folders in the star_salmon folder will look like this:

.
├── bigwig
├── CD49fmNGFRm_rep1
├── CD49fmNGFRm_rep1.markdup.sorted.bam
├── CD49fmNGFRm_rep1.markdup.sorted.bam.bai
├── CD49fmNGFRm_rep2
├── CD49fmNGFRm_rep2.markdup.sorted.bam
├── CD49fmNGFRm_rep2.markdup.sorted.bam.bai
├── CD49fmNGFRm_rep3
├── CD49fmNGFRm_rep3.markdup.sorted.bam
├── CD49fmNGFRm_rep3.markdup.sorted.bam.bai
├── CD49fpNGFRp_rep1
├── CD49fpNGFRp_rep1.markdup.sorted.bam
├── CD49fpNGFRp_rep1.markdup.sorted.bam.bai
├── CD49fpNGFRp_rep2
├── CD49fpNGFRp_rep2.markdup.sorted.bam
├── CD49fpNGFRp_rep2.markdup.sorted.bam.bai
├── CD49fpNGFRp_rep3
├── CD49fpNGFRp_rep3.markdup.sorted.bam
├── CD49fpNGFRp_rep3.markdup.sorted.bam.bai
├── deseq2_qc
├── dupradar
├── featurecounts
├── log
├── MTEC_rep1
├── MTEC_rep1.markdup.sorted.bam
├── MTEC_rep1.markdup.sorted.bam.bai
├── MTEC_rep2
├── MTEC_rep2.markdup.sorted.bam
├── MTEC_rep2.markdup.sorted.bam.bai
├── MTEC_rep3
├── MTEC_rep3.markdup.sorted.bam
├── MTEC_rep3.markdup.sorted.bam.bai
├── picard_metrics
├── qualimap
├── rseqc
├── salmon.merged.gene_counts_length_scaled.rds
├── salmon.merged.gene_counts_length_scaled.tsv
├── salmon.merged.gene_counts.rds
├── salmon.merged.gene_counts_scaled.rds
├── salmon.merged.gene_counts_scaled.tsv
├── salmon.merged.gene_counts.tsv  <---- We will use this feature counts file for DESeq2 expression analysis. 
├── salmon.merged.gene_tpm.tsv
├── salmon.merged.transcript_counts.rds
├── salmon.merged.transcript_counts.tsv
├── salmon.merged.transcript_tpm.tsv
├── salmon_tx2gene.tsv
├── samtools_stats
└── stringtie

The expression count file that we are interested is salmon.merged.gene_counts.tsv Let's see the content of the file by printing the top lines using the following command:

head salmon.merged.gene_counts.tsv

the above command will print:

gene_id gene_name       CD49fmNGFRm_rep1        CD49fmNGFRm_rep2        CD49fmNGFRm_rep3        CD49fpNGFRp_rep1        CD49fpNGFRp_rep2        CD49fpNGFRp_rep3     MTEC_rep1       MTEC_rep2       MTEC_rep3
ENSMUSG00000000001      Gnai3   2460    2395    2749    2686    3972    4419    7095    4484    6414
ENSMUSG00000000003      Pbsn    0       0       0       0       0       0       0       0       0
ENSMUSG00000000028      Cdc45   43      57      55      79      87.999  89      1241    830     1041.999
ENSMUSG00000000031      H19     2       0       1       17.082  24      16.077  200     139     145.604
ENSMUSG00000000037      Scml2   8       8       16      23      29.001  29      69      57      67
ENSMUSG00000000049      Apoh    1       0       2       1       2       0       0       0       2
ENSMUSG00000000056      Narf    522     496     539     368     457     538     1939    1483    1734
ENSMUSG00000000058      Cav2    1352.999        1349    1371.999        2684.001        4370    4386    6018.999        3429    5501
ENSMUSG00000000078      Klf6    4411    3492    4500    3221    3989    4637    3812    2741    3558

Now let’s copy the ‘salmon.merged.gene_counts.tsv’ file to your laptop/desktop using the file finder.

Windows - in the file finder type: \\hpc-fs\work\
Mac - open ‘Finder’ → press ‘command + k’ → type: ‘smb://hpc-fs/work/’

Now let’s find the full path to the ‘salmon.merged.gene_counts.tsv’ file:

Windows:
- see: https://www.lifewire.com/change-directories-in-command-prompt-5185508
Mac:
- cd /folder/that/contains/feature_counts/
- pwd
Rstudio:
- Open Rstudio, go to the top bar a click on “Session” → “Select working directory: → “Choose directory”
- The path to the directory will be printed in Rstudio console, copy and paste in line 1 of the script ‘RNAseq_DESeq2_analysis.R’ (see below)

Differential Expression Analysis using DESeq2

We will now perform the following tasks using Rstudio:

install R package manager
install R packages (only needed once) - After installation we only need to load the packages
set working directory (directory where the gene counts file is located)
run DESeq2 analysis, export results and generate plots

Open R Studio and if not prompted automatically, create a R script file as follows:

Open Rstudio → click on ‘File’ in the top bar → select “New File” → select “R script”

Install R package manager

We will use Bioconductor to install the “BiocManager”. Copy and paste the following code into the R studio script. Select all three lines in the R script and click on “Run”. Note: version 3.17 requires a R version 4.3 (which.

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install(version = "3.17")

Install / Load packages:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("DESeq2")

#load packages
library("DESeq2")
library(ggplot2)
library("ggrepel")
library("EnhancedVolcano")

#install.packages("dendsort")
library(dendsort)
library("reshape2")
library("plyr")
library(Cairo)
library(pheatmap)
library("dendextend")
require(reshape2)

Prepare INPUT files

INPUT 1: Create a folder called ‘data’ and copy the gene counts file to a folder. For example, the gene counts table looks like this:

gene_id gene_name       CD49fmNGFRm_rep1        CD49fmNGFRm_rep2        CD49fmNGFRm_rep3        CD49fpNGFRp_rep1        CD49fpNGFRp_rep2        CD49fpNGFRp_rep3     MTEC_rep1       MTEC_rep2       MTEC_rep3
ENSMUSG00000000001      Gnai3   2460    2395    2749    2686    3972    4419    7095    4484    6414
ENSMUSG00000000003      Pbsn    0       0       0       0       0       0       0       0       0
ENSMUSG00000000028      Cdc45   43      57      55      79      87.999  89      1241    830     1041.999
ENSMUSG00000000031      H19     2       0       1       17.082  24      16.077  200     139     145.604
ENSMUSG00000000037      Scml2   8       8       16      23      29.001  29      69      57      67
ENSMUSG00000000049      Apoh    1       0       2       1       2       0       0       0       2
ENSMUSG00000000056      Narf    522     496     539     368     457     538     1939    1483    1734
ENSMUSG00000000058      Cav2    1352.999        1349    1371.999        2684.001        4370    4386    6018.999        3429    5501
ENSMUSG00000000078      Klf6    4411    3492    4500    3221    3989    4637    3812    2741    3558
ENSMUSG00000000085      Scmh1   1561.002        1549.002        1694.001        2276.001        2663    3290.999        2437.001        1759.001    2325.002
ENSMUSG00000000088      Cox5a   1136    1425    1691    947     1410    1613.998        3173.999        1873.992        2865.998
ENSMUSG00000000093      Tbx2    74      40      66      354     380     460     444     295     436
ENSMUSG00000000094      Tbx4    1       0       0       0       0       0       1       1       1
ENSMUSG00000000103      Zfy2    0       0       0       0       0       0       0       0       0
ENSMUSG00000000120      Ngfr    379     313     363     6577    7083    8670    4050    3336    3534
ENSMUSG00000000125      Wnt3    11      19      21      1       0       0       3       0       3
ENSMUSG00000000126      Wnt9a   69      104     89      405     480     554     1663    1012    1730
ENSMUSG00000000127      Fer     696     776     898     558     777     958     911     700     892
ENSMUSG00000000131      Xpo6    2206    2172    2408    2402    2866    3327.002        3919    3204    3763

INPUT 2: Sample metadata:

sample_name	sample_ID	group
CD49fmNGFRm_rep1	DC1	Differentiated_cells
CD49fmNGFRm_rep2	DC2	Differentiated_cells
CD49fmNGFRm_rep3	DC3	Differentiated_cells
CD49fpNGFRp_rep1	BC1	Basal_cells
CD49fpNGFRp_rep2	BC2	Basal_cells
CD49fpNGFRp_rep3	BC3	Basal_cells
MTEC_rep1	mTEC1	Murine_tracheal_epithelial_cell
MTEC_rep2	mTEC2	Murine_tracheal_epithelial_cell
MTEC_rep3	mTEC3	Murine_tracheal_epithelial_cell

R script - run the script using Rstudio

A copy of the script is at /demo/run3_full_pipeline/

RNAseq_DESeq2.R script:

# Differential expression analysis

# When you see '## USER INPUT', this means you have to modify the code for your computer or dataset. All other code can be run as-is (i.e. you don't need to understand the code, just run it)

#### Installing required packages ####

# **NOTE: this section only needs to be run once (or occasionally to update the packages)
# Install devtools
install.packages("devtools", repos = "http://cran.us.r-project.org")
# Install R packages. This only needs to be run once.
bioconductor_packages <- c("DESeq2", "EnhancedVolcano")
cran_packages <- c("ggrepel", "ggplot2", "plyr", "reshape2", "readxl", "FactoMineR", "factoextra")
# Compares installed packages to above packages and returns a vector of missing packages
new_packages <- bioconductor_packages[!(bioconductor_packages %in% installed.packages()[,"Package"])]
new_cran_packages <- cran_packages[!(cran_packages %in% installed.packages()[,"Package"])]
# Install missing bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install(new_packages)
# Install missing cran packages
if (length(new_cran_packages)) install.packages(new_cran_packages, repos = "http://cran.us.r-project.org")
# Update all installed packages are the latest version
update.packages(bioconductor_packages, ask = FALSE)
update.packages(cran_packages, ask = FALSE, repos = "http://cran.us.r-project.org")


#### Loading required packages ####

# This section needs to be run very time
# Load packages
bioconductor_packages <- c("DESeq2", "EnhancedVolcano")
cran_packages <- c("ggrepel", "ggplot2", "plyr", "reshape2", "readxl", "FactoMineR", "factoextra")
lapply(cran_packages, require, character.only = TRUE)
lapply(bioconductor_packages, require, character.only = TRUE)


#### Import your count data ####

# Make sure you have: a) your count table (salmon.merged.gene_counts.tsv file, if you used Nextflow nfcore/rnaseq to analyse your data). Copy this to a subdirectory called 'data'. b) your metadata file. This should be either an Excel file called 'metadata.xlsx' or a tab-separated text file called 'metadata.txt'. It needs 3 columns called 'sample_name', 'sample_ID' and 'group'. The sample names should be EXACTLY the same as the names in the count table. These names are often uninformative and long, so the 'sample_ID' is the sample labels you want to put on your plots. E.g. if you have a 'high fat' group, you might want to rename the samples HF1, HF2, HF3, etc)

## USER INPUT
# Set working directory. 
# Change this to your working directory (In the RStudio menu: Session -> Set working directory -> Choose working directory)
setwd("C:/Users/whatmorp/OneDrive - Queensland University of Technology/Desktop/Projects/RNA-Seq downstream analysis")

# Import your count data. make sure you've created a 'data' subdirectory and put the count table file there.
metacountdata <- read.table("./data/salmon.merged.gene_counts.tsv", header = TRUE, row.names = 1)

# Import metadata (option 1: text file). Again, need a metadata.txt file in the data subdirectory.
meta <- read.table("./data/metadata.txt", header = T)
# Import metadata (option 1: Excel file))
#meta <- read_excel("./data/metadata.xlsx")

# Remove 1st columns of metadata (gene_name)
counts <- metacountdata[ ,2:ncol(metacountdata)]

# Rename sample names to new sample IDs
counts <- counts[as.character(meta$sample_name)]
colnames(counts) <- meta$sample_ID

# Counts need to be rounded to integers
counts <- ceiling(counts)


#### Outliers and batch effects ####

# This section normalises and transforms the count data so that it can be plotted on a PCA plot and a heatmap

# Normalise counts by library size ##
# Using DeSeq2's estimateSizeFactors(). Note that DeSeq2 does this internally during DEG calling. The normalisation below is done separately for PCA and density plotting.

## USER INPUT
# Choose the groups you want to plot in a PCA/Heatmap. You can select any 2 or more of the groups (or all of the groups) you have in your 'groups' column of your metadata table
# To see what groups are present, run the following:
unique(meta$group)
# Now add which groups you want to plot (i.e. replace the groupnames below, and add more, separated by a comma and in "quotes", as needed). NOTE: R is case-sensitive, so these group names must be named EXACTLY the same as in the metadata table.
plotgroups <- c("Differentiated_cells", "Basal_cells")

# Pull out only the counts from the above groups
groupcounts <- counts[meta$group %in% plotgroups]

# Set up the initial DeSeq2 experimental parameters. This is the length of the no' of samples being analysed. In full DeSeq2 analysis this is to separate groups as factors, but here a simple count is fine.
condition <- factor(1:length(groupcounts))
# Set up the column data. A data frame of sample ID's and conditions
coldata <- data.frame(row.names=colnames(groupcounts), condition)
# Set up the DeSeq2 data set structure
f <- DESeqDataSetFromMatrix(countData = groupcounts, colData = coldata, design= ~ condition)
# Estimate the size factors. See DeSeq2 manual for details
f <- estimateSizeFactors(f)
# Size factors can be viewed by: sizeFactors(f)

# Multiply each row (sample) by the corresponding size factor
subcount_norm <- as.matrix(groupcounts) %*% diag(sizeFactors(f))

# Re-add column names
colnames(subcount_norm) <- colnames(groupcounts)

## Remove low coverage transcripts (mean count < 10) ##

# Find the mean of each row (and output as a data frame object)
means <- as.data.frame(rowMeans(subcount_norm))

# Then join the means data with the counts
means <- cbind(means, subcount_norm)

# Then subset out only genes with mean > 10
data <- subset(means, means[ , 1] > 10)

# Remove the means column
data <- data[,-1]

# Transform data
data_log <- vst(round(as.matrix(data)))

# Transformation can create some infinite values. Can't generate PCA data on these. Can see how many by: sum(sapply(data_log, is.infinite))
# To remove infinite rows, use 'is.finte' or '!is.infinite'
data_log <- data_log[is.finite(rowSums(data_log)),]
colnames(data_log) <- colnames(groupcounts)

### Set up the PCA plot base data ###

# We're using the FactoMineR package to generate PCA plots (http://factominer.free.fr/index.html)

# Need to transpose the data first
data_log_t <- t(data_log)

# Add the group data
data_log_t_vars <- data.frame(meta$group[meta$group %in% plotgroups], data_log_t)

# Generate the PCA data using FactoMineR package
res.pca <- PCA(data_log_t_vars, quali.sup = 1, graph=FALSE)

## Set up the dendogram/heatmaps base data ##

# Calculate the distance matrix:
distance_matrix <- as.matrix(dist(t(data_log)))


#### PCA plot ####

# Generate the PCA plot. Groups are shaded with elipses at 95% confidence level. NOTE: at least 4 replicates need to be in a group for an elipses to be drawn.
p <- fviz_pca_ind(res.pca,
                  geom.ind = "point", # show points only (but not "text")
                  col.ind = meta$group[meta$group %in% plotgroups], # color by groups
                  pointsize = 5, title = "",
                  addEllipses = TRUE, ellipse.type = "t", ellipse.level = 0.95) + theme(legend.text = element_text(size = 12), legend.title = element_text(size = 14), axis.title=element_text(size=16), axis.text=element_text(size=14))

p

## USER INPUT. Change these to 
pdf_filename <- "pca.pdf"
tiff_filename <- "pca.tiff"

# Output as publication quality (300dpi) tiff and pdf (Note: this will create a 'figures subdirectory where all figures will be output)
dir.create("figures", showWarnings = FALSE)
ggsave(file = paste0("./figures/", tiff_filename), dpi = 300, compression = "lzw", device = "tiff", width = 10, height = 8, plot = p)
ggsave(file = paste0("./figures/", pdf_filename), device = "pdf", width = 10, height = 8, plot = p)


#### Heatmap and dendrogram ####