Table of Contents |
---|
...
This is the preferred method, as R and RStudio is are already installed, as are all the required R packages needed for analysis. Installing all of these can take over 30 minutes on your own PC, so using an rVDI machine saves time.
...
To access and run an rVDI virtual desktop:
Go to https://rvdi.qut.edu.au/
Click on ‘VMware ‘VMware Horizon HTML Access’Access’
Log on with your QUT username and password
*NOTE: you need to be connected to the QUT network first, either being on campus or connecting remotely via VPN.
Option2: Install R and RStudio on your own PC
...
The basis for the differential expression analysis is a count table of sequence reads mapped to defined gene regions per sample. There are a variety of methods to generate this count table, but for this exercise we will be using the output from the Nextflow nfcore/rnaseq analysis you previously completed in the previous workshop sessions.
To access this count table:
Go to the/sandpit/demo/run3_full_pipeline/ W:\training\rnaseq\runs\run3_RNAseq\results folder that contains the results from running the nfcore/rnaseq pipeline. The output folders from task 3 look like this:
...
The count table can be found in the /results/star_salmon / folder. Let’s access the folder (i.e., cd /run3/results/star_salmon). A list of files and folders in the star_salmon folder will look like this:
...
The expression count file that we are interested is salmon.merged.gene_counts.tsv
Let's see the content of the file by printing the top lines using the following command (in PuTTy):
Code Block |
---|
head salmon.merged.gene_counts.tsv |
Differential Expression Analysis using DESeq2
the above command will print:
Code Block |
---|
gene_id gene_name CD49fmNGFRm_rep1 CD49fmNGFRm_rep2 CD49fmNGFRm_rep3 CD49fpNGFRp_rep1 CD49fpNGFRp_rep2 CD49fpNGFRp_rep3 MTEC_rep1 MTEC_rep2 MTEC_rep3
ENSMUSG00000000001 Gnai3 2460 2395 2749 2686 3972 4419 7095 4484 6414
ENSMUSG00000000003 Pbsn 0 0 0 0 0 0 0 0 0
ENSMUSG00000000028 Cdc45 43 57 55 79 87.999 89 1241 830 1041.999
ENSMUSG00000000031 H19 2 0 1 17.082 24 16.077 200 139 145.604
ENSMUSG00000000037 Scml2 8 8 16 23 29.001 29 69 57 67
ENSMUSG00000000049 Apoh 1 0 2 1 2 0 0 0 2
ENSMUSG00000000056 Narf 522 496 539 368 457 538 1939 1483 1734
ENSMUSG00000000058 Cav2 1352.999 1349 1371.999 2684.001 4370 4386 6018.999 3429 5501
ENSMUSG00000000078 Klf6 4411 3492 4500 3221 3989 4637 3812 2741 3558 |
Now let’s copy the ‘salmon.merged.gene_counts.tsv’ file to your laptop/desktop using the file finder.
|
---|
Now let’s find the full path to the ‘salmon.merged.gene_counts.tsv’ file:
|
---|
Differential Expression Analysis using DESeq2
We will now perform the following tasks using Rstudio:
Preparing your data. 2 data files needed: a samples table and your count table.
Install required R packages (only need to run once) - After installation we only need to load the packages. NOTE: If using an rVDI virtual machine, the R packages are already installed.
Load required R packages. Unlike installing the packages, this needs to be done every time you run the analysis
Import your data files (count table and samples table)into R
Checking for outliers and batch effects
PCA plot
Pairwise samples heatmap
Identify differentially expressed (DE) genes using DESeq2
Annotating your DE genes
Volcano plot
DE genes heatmap
1. Preparing your data
You will need 2 data files to complete this analysis: your count table (see above) and a samples table.
a. First create a new folder, on your desktop, Documents, etc. Call it something informative, such as ‘DE analysis workshop’
b. Create a sub folder here called ‘data’. This is where your two data files will be stored
c. Copy the count table (the 'salmon.merged.gene_counts.tsv
' file which you downloaded from the HPC) to that folder
The count table looks like this:
Code Block |
---|
gene_id gene_name CD49fmNGFRm_rep1 CD49fmNGFRm_rep2 CD49fmNGFRm_rep3 CD49fpNGFRp_rep1 CD49fpNGFRp_rep2 CD49fpNGFRp_rep3 MTEC_rep1 MTEC_rep2 MTEC_rep3
ENSMUSG00000000001 Gnai3 2460 2395 2749 2686 3972 4419 7095 4484 6414
ENSMUSG00000000003 Pbsn 0 0 0 0 0 0 0 0 0
ENSMUSG00000000028 Cdc45 43 57 55 79 87.999 89 1241 830 1041.999
ENSMUSG00000000031 H19 2 0 1 17.082 24 16.077 200 139 145.604
ENSMUSG00000000037 Scml2 8 8 16 23 29.001 29 69 57 67
ENSMUSG00000000049 Apoh 1 0 2 1 2 0 0 0 2 |
d. In Excel, create a samples table. Open a blank Excel document and copy and paste the sample metadata belowWe will now perform the following tasks using Rstudio:
Preparing your data. Two data files needed for this analysis: a samples table and your count table
Install required R packages (only need to run once) - after installation we only need to load the packages. NOTE: If using an rVDI virtual machine, the R packages are already installed
Load required R packages. Unlike installing the packages, this needs to be done every time you run the analysis
Import your data files (count table and samples table)into R
Checking for outliers and batch effects
PCA plot
Pairwise samples heatmap
Identify differentially expressed (DE) genes using DESeq2
Annotating your DE genes
Volcano plot
DE genes heatmap
1. Preparing your data
You will need 2 data files to complete this analysis: your count table (see above) and a samples table.
a. First create a new folder in H:\workshop\RNAseq . Call it something suitable, such as ‘DE_analysis_workshop’
b. Create a sub folder here called ‘data’. This is where your two data files will be stored
c. Copy the count table (the ‘salmon.merged.gene_counts.tsv
' file) to the 'data’ folder you just created.
The count table looks like this:
Code Block |
---|
gene_id gene_name SRR20622172 SRR20622173 SRR20622174 SRR20622175 SRR20622176 SRR20622177 SRR20622178 SRR20622179 SRR20622180
ENSMUSG00000000001 Gnai3 7086 4470 2457.002 2389 6398 2744 2681 3961 4399
ENSMUSG00000000003 Pbsn 0 0 0 0 0 0 0 0 0
ENSMUSG00000000028 Cdc45 1232.999 827 42 57 1036 55 78 88 89
ENSMUSG00000000031 H19 200 139 2 0 143.622 1 17.082 24 16.077
ENSMUSG00000000037 Scml2 70 57.001 8 8 66.999 16 23 27.999 29
ENSMUSG00000000049 Apoh 0 0 1 0 2 2 1 3 0
ENSMUSG00000000056 Narf 1933 1480 519 497 1730 539 365 458 536
ENSMUSG00000000058 Cav2 6008 3417 1347.001 1344 5482 1367 2669.001 4358 4365.832
ENSMUSG00000000078 Klf6 3809 2732 4413.001 3483.978 3559 4491 3209 3980 4626 |
d. In the same W:\training\rnaseq\runs\run3_RNAseq\results\star_salmon directory there will be a file called metadata.xlsx . Copy this file to your ‘data’ folder as well. This file will normally need to be manually created by you to match your sample IDs and treatment groups, but we created this file already for you to use. This samples table needs 3 columns called ‘sample_name’, containing the sample names seen in the count table (column names), ‘sample_ID’, which is the (less messy) names you want to call the samples in the this analysis workflow, and ‘group’, which contains the treatment groups each sample belongs to. Sample metadataThe contents of this file look like this:
Code Block |
---|
sample_name sample_ID group CD49fmNGFRm_rep1SRR20622174 DC1 Differentiated_cells CD49fmNGFRm_rep2SRR20622175 DC2 Differentiated_cells CD49fmNGFRm_rep3SRR20622177 DC3 Differentiated_cells CD49fpNGFRp_rep1SRR20622178 BC1 Basal_cells CD49fpNGFRp_rep2SRR20622179 BC2 Basal_cells CD49fpNGFRp_rep3SRR20622180 BC3 Basal_cells MTEC_rep1SRR20622172 mTEC1 Murine_tracheal_epithelial_cell MTEC_rep2SRR20622173 mTEC2 Murine_tracheal_epithelial_cell MTEC_rep3SRR20622176 mTEC3 Murine_tracheal_epithelial_cell |
...
e. Open RStudio and create a new R script (‘File’ → “New File” → “R script”). Now hit ‘File’ → ‘Save’ and save the script in the analysis workshop folder you created in step a. (NOT IN THE ‘DATA’ FOLDER)‘data’ FOLDER). Give the script file a name (e.g. DESEq2.R).
The following analysis contains R code (in the grey text boxes) that you can copy and paste, then run, into the R script you just created.A copy of the full script is at /demo/run3_full_pipeline/
2. Install required R packages
Copy and paste the following code into the R script you just created, then run the code (highlight all the code in your R script, then press the run button). This will install all the required packages and dependencies and may take 30 minutes or more to complete. It may prompt you occasionally to update packages - select 'a' for all if/when this occurs.
...
NOTE: you only need to run this section once on any laptop/PC, and you don’t need to run it if you’re using an rVDI machine.
Code Block |
---|
#### Differential expression analysis #### # When you see '## USER INPUT', this means you have to modify the code for your computer or dataset. All other code can be run as-is (i.e. you don't need to understand the code, just run it) #### 2. Installing required packages #### # **NOTE: this section only needs to be run once (or occasionally to update the packages) # Install devtools install.packages("devtools", repos = "http://cran.us.r-project.org") # Install R packages. This only needs to be run once. bioconductor_packages <- c("DESeq2", "EnhancedVolcano", "org.Hs.eg.db", "org.Mm.eg.db", "org.Rn.eg.db", "org.EcK12.eg.db", "org.EcSakai.eg.db", "org.Dr.eg.db", "org.Dm.eg.db") cran_packages <- c("ggrepel", "ggplot2", "plyr", "reshape2", "readxl", "FactoMineR", "factoextra", "pheatmap") # Compares installed packages to above packages and returns a vector of missing packages new_packages <- bioconductor_packages[!(bioconductor_packages %in% installed.packages()[,"Package"])] new_cran_packages <- cran_packages[!(cran_packages %in% installed.packages()[,"Package"])] # Install missing bioconductor packages if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(new_packages) # Install missing cran packages if (length(new_cran_packages)) install.packages(new_cran_packages, repos = "http://cran.us.r-project.org") # Update all installed packages areto the latest version update.packages(bioconductor_packages, ask = FALSE) update.packages(cran_packages, ask = FALSE, repos = "http://cran.us.r-project.org") |
...
Code Block |
---|
#### 4. Import your count data #### # Make sure you have: a) your count table (salmon.merged.gene_counts.tsv file, if you used Nextflow nfcore/rnaseq to analyse your data). Copy this to a subdirectory called 'data'. b) your metadata file. This should be either an Excel file called 'metadata.xlsx' or a tab-separated text file called 'metadata.txt'. It needs 3 columns called 'sample_name', 'sample_ID' and 'group'. The sample names should be EXACTLY the same as the names in the count table. These names are often uninformative and long, so the 'sample_ID' is the sample labels you want to put on your plots. E.g. if you have a 'high fat' group, you might want to rename the samples HF1, HF2, HF3, etc) ## USER INPUT # Set working directory. # Change this to your working directory (In the RStudio menu: Session -> Set working directory -> Choose working directory) setwd("C:/Users/whatmorp/OneDrive - Queensland University of Technology/Desktop/Projects/RNA-Seq downstream analysis") # Import your count data. make sure you've created a 'data' subdirectory and put the count table file there. metacountdata <- read.table("./data/salmon.merged.gene_counts.tsv", header = TRUE, row.names = 1) # Import metadata. Again, need a metadata.txtxlsx file in the data subdirectory. meta <- read_excel("./data/metadata.xlsx") # Remove 1st columns of metadata (gene_name) counts <- metacountdata[ ,2:ncol(metacountdata)] # Rename sample names to new sample IDs counts <- counts[as.character(meta$sample_name)] colnames(counts) <- meta$sample_ID # Counts need to be rounded to integers counts <- ceiling(counts) |
...
First we need to prepare the data for plotting. Copy, paste and run the following code in your R script. you will need to input which treatment groups you wish to plot (plotgroups <-
) from the set of available treatment groups (which you can find out with unique(meta$group)
.
Code Block |
---|
#### 5. Outliers and batch effects ####
# This section normalises and transforms the count data so that it can be plotted on a PCA plot and a heatmap
## USER INPUT
# Choose the groups you want to plot in a PCA/Heatmap. You can select any 2 or more of the groups (or all of the groups) you have in your 'groups' column of your metadata table
# To see what groups are present, run the following:
unique(meta$group)
# Now add which groups you want to plot (i.e. replace the groupnames below, and add more, separated by a comma and in "quotes", as needed). NOTE: R is case-sensitive, so these group names must be named EXACTLY the same as in the metadata table.
plotgroups <- c("Differentiated_cells", "Basal_cells")
# Pull out only the counts from the above groups
groupcounts <- counts[meta$group %in% plotgroups]
# Normalise counts by library size, using DeSeq2's estimateSizeFactors() function. Note that DeSeq2 does this internally during DEG calling. The normalisation below is done separately for PCA and density plotting.
# Set up the initial DeSeq2 experimental parameters.
condition <- factor(1:length(groupcounts))
# Set up the column data. A data frame of sample ID's and conditions
coldata <- data.frame(row.names=colnames(groupcounts), condition)
# Set up the DeSeq2 data set structure
f <- DESeqDataSetFromMatrix(countData = groupcounts, colData = coldata, design= ~ condition)
# Estimate the size factors. See DeSeq2 manual for details
f <- estimateSizeFactors(f)
# Size factors can be viewed by: sizeFactors(f)
# Multiply each row (sample) by the corresponding size factor
subcount_norm <- as.matrix(groupcounts) %*% diag(sizeFactors(f))
# Re-add column names
colnames(subcount_norm) <- colnames(groupcounts)
## Remove low coverage transcripts (mean count < 10) ##
# Find the mean of each row (and output as a data frame object)
means <- as.data.frame(rowMeans(subcount_norm))
# Then join the means data with the counts
means <- cbind(means, subcount_norm)
# Then subset out only genes with mean > 10
data <- subset(means, means[ , 1] > 10)
# Remove the means column
data <- data[,-1]
# Transform data
data_log <- vst(round(as.matrix(data)))
# Transformation can create some infinite values. Can't generate PCA data on these. Can see how many by: sum(sapply(data_log, is.infinite))
# To remove infinite rows, use 'is.finte' or '!is.infinite'
data_log <- data_log[is.finite(rowSums(data_log)),]
colnames(data_log) <- colnames(groupcounts)
### Set up the PCA plot base data ###
# We're using the FactoMineR package to generate PCA plots (http://factominer.free.fr/index.html)
# Need to transpose the data first
data_log_t <- t(data_log)
# Add the group data
data_log_t_vars <- data.frame(meta$group[meta$group %in% plotgroups], data_log_t)
# Generate the PCA data using FactoMineR package
res.pca <- PCA(data_log_t_vars, quali.sup = 1, graph=FALSE)
## Set up the dendogram/heatmaps base data ##
# Calculate the distance matrix:
distance_matrix <- as.matrix(dist(t(data_log)))
|
5a. PCA plot
Now you can run the following code in your R script to generate the PCA plot.
We will be using the factoextra package to generate both generate the PCA data and the PCA plot.
Code Block |
---|
#### 5a. PCA plot ####
# Generate the PCA plot. Groups are shaded with ellipses at 95% confidence level. NOTE: at least 4 replicates need to be in a group for an ellipses to be drawn.
# NOTE: change the group point colours by changing 'palette = ' below. Use the 'RColourBrewer' colour names (https://r-graph-gallery.com/38-rcolorbrewers-palettes.html). For example, if you are plotting 3 groups and choose palette = "Set1", this will use the first 3 colours from the Set1 colour palette.
p <- fviz_pca_ind(res.pca,
geom.ind = "point", # show points only (but not "text")
col.ind = meta$group[meta$group %in% plotgroups], # color by groups
pointsize = 5, title = "", legend.title = "Treatment groups", palette = "Dark2",
addEllipses = TRUE, ellipse.type = "t", ellipse.level = 0.95) + theme(legend.text = element_text(size = 12), legend.title = element_text(size = 14), axis.title=element_text(size=16), axis.text=element_text(size=14))
p
# Output as publication quality (300dpi) tiff and pdf.
# This will name your output files with the treatment groups you selected.
# Create a 'figures' subdirectory where all figures will be output
dir.create("figures", showWarnings = FALSE)
# Create a (300dpi) tiff
ggsave(file = paste0("./figures/PCA_", paste(plotgroups, collapse = "_Vs_"), ".tiff"), dpi = 300, compression = "lzw", device = "tiff", width = 10, height = 8, plot = p)
# Create a pdf
ggsave(file = paste0("./figures/PCA_", paste(plotgroups, collapse = "_Vs_"), ".pdf"), device = "pdf", width = 10, height = 8, plot = p) |
...
We will be using the pretty heatmap package to accomplish this.
Code Block |
---|
#### 4b5b. Samples heatmap and dendrogram #### # This section plots a heatmap and dendrogram of pairwise relationships between samples. In this way you can see if samples cluster by treatment group. # See here: https://davetang.org/muse/2018/05/15/making-a-heatmap-in-r-with-the-pheatmap-package/ # Define annotation column annot_columns <- data.frame(meta$group[meta$group %in% plotgroups]) # Make the row names the sample IDs row.names(annot_columns) <- meta$sample_ID[meta$group %in% plotgroups] colnames(annot_columns) <- "Treatment groups" # Need to factorise it annot_columns[[1]] <- factor(annot_columns[[1]]) # Generate dendrogram and heatmap pheatmap(distance_matrix, color=colorRampPalette(c("white", "#9999FF", "#990000"))(50), cluster_rows = TRUE, show_rownames = TRUE, treeheight_row = 0, treeheight_col = 70, fontsize_col = 12, annotation_names_col = F, annotation_col = annot_columns, filename = paste0("./figures/Pairwise_sample_heatmap_", paste(plotgroups, collapse = "_Vs_"), ".tiff")) # Notes about heatmap colours. # You can change the colours used in the heatmap itself by changing the colour names (color=colorRampPalette....) # If you want to change the annotation colours, see here: https://zhiganglu.com/post/pheatmap_change_annotation_colors/ |
...
Now we come to the main analysis section of this workflow, where we will identify differentially expressed genes. This will generate two main important datapoints per gene, the log fold change, which shows the change in expression levels between two treatment groups for a specific gene, and the adjusted p value, which shows which genes are significantly differentially expressed (adjusted to remove false positives).
...
The table of DE genes will have gene names according to the reference genome that was used to generate the count table. This is usually an Entrez gene ID if the NCBI genome was used, or an Ensemble gene ID if the Ensemble genome was used.
These DE gene names Entrez/Ensembl gene IDs are a string of numbers that aren’t particularly informative. They can be individually looked to see what genes they arerepresent, or we can annotate them all together for their common at the same time in R to match the Entrez/Ensembl gene IDs to their more commonly known gene symbol and description, using the code in this section.
We'll be using bioconductor genome wide annotation packages to provide the annotation data. https://bioconductor.org/packages/3.17/data/annotation/
...
This workshop uses mouse data, so we will be using the org.Mm.eg.db annotation package here.
Code Block |
---|
#### 6a. Annotating your DE genes #### # Annotation packages for human (org.Hs.eg.db), mouse (org.Mm.eg.db), rat (org.Rn.eg.db), E. coli strain K12 (org.EcK12.eg.db), E. coli strain Sakai (org.EcSakai.eg.db), zebrafish (org.Dr.eg.db) and Drosophila (org.Dm.eg.db) were installed with the required R packages. If your species is not in the above list, contact the eResearch team. ## USER INPUT # Input your species genome below (select from the above list) my_genome <- org.Mm.eg.db # Pull out just the Entrez/Ensembl gene IDs gene_ids <- row.names(res) # You can see the list of annotations you can apply to your data by: keytypes(my_genome) # Annotate gene symbol and description to your DE gene IDs (you can add other keytypes from the above 'keytypes(my_genome)' list, if you choose) cols <- c("SYMBOL", "GENENAME") ## USER INPUT # Provide the gene ID type for your DE data # If you have Ensemble IDs, enter "ENSEMBL", if you have Entrez IDs, enter "ENTREZID", if you have gene symbols, enter "SYMBOL" idtype <- "ENSEMBL" # Map the Entrez/Ensembl gene IDs to gene symbol and description map <- AnnotationDbi::select(my_genome, keys=gene_ids, columns=cols, keytype=idtype) # Combine the annotation data with the DE data # Since we're matching Ensembl -> Entrez there aren't a 1:1 mapping, so need to merge rather than cbind annot <- merge(x = res, y = map, by.x = 0, by.y = idtype, all = F) # There isn't always a 1:1 mapping between gene identifiers, so we also need to remove duplicates annot_nodups <- annot[!duplicated(annot$Row.names), ] # Reorder by adjusted p annot_nodups <- annot_nodups[order(annot_nodups$padj), ] # Add normalised counts to the output table. This is so you can later plot expression trends for individual genes in R, Excel, etc. # Need to normalise the counts first, using the size factors calculated by DESeq2 (in the 'deseq' object) expdata_norm <- as.matrix(expdata) %*% diag(deseq$sizeFactor) colnames(expdata_norm) <- colnames(expdata) annot_counts <- merge(x = annot_nodups, y = expdata_norm, by.x = "Row.names", by.y = 0, all = TRUE) # Export the full annotated dataset as a csv (for journal submission) # Creates a 'Tables' subdirectory where all tables will be output dir.create("Tables", showWarnings = FALSE) write.csv(annot_counts, file=paste0("./Tables/all_genes_", paste(degroups, collapse = "_Vs_"), ".csv"), row.names = FALSE) # Pull out just significant genes (change from 0.05 to 0.01 if needed) DE_genes <- subset(annot_counts, padj < 0.05, select=colnames(annot_counts)) # NOTE: add lfc cutoffs if needed. E.g., log2FoldChange > 1 and < log2FoldChange -1 cutoff # DE_genes <- subset(DE_genes, log2FoldChange > 1 | log2FoldChange < -1, select=colnames(DE_genes)) # Export the sig DE genes as a table (to be used in downstream analysis) write.csv(DE_genes, file=paste0("./Tables/DE_genes_", paste(degroups, collapse = "_Vs_"), ".csv"), row.names = FALSE) |
...
This section plots a heatmap and dendrogram of DE gene expression per sample. Expression counts are scaled and centered so that groupwise relationships can be examined.
Code Block |
---|
#### 5c6c. DE genes heatmaps and dendrograms #### # Make the row names gene symbols. DE_genes <- na.omit(DE_genes) row.names(DE_genes) <- make.unique(DE_genes$SYMBOL) # sort by p-value DE_genes <- DE_genes[order(DE_genes$padj), ] # Pull out normalised counts only siggc <- DE_genes[colnames(DE_genes) %in% colnames(expdata)] # Scale and center each row. This is important to visualise relative differences between groups and not have row-wise colouration dominated by high or low gene expression. xts <- scale(t(siggc)) xtst <- t(xts) # Define annotation column annot_columns <- data.frame(meta$group[meta$group %in% degroups]) # Make the row names the sample IDs row.names(annot_columns) <- meta$sample_ID[meta$group %in% degroups] colnames(annot_columns) <- "Treatment groups" # Need to factorise it annot_columns[[1]] <- factor(annot_columns[[1]]) # Generate dendrogram and heatmap for ALL DE genes pheatmap(xtst, color=colorRampPalette(c("#D55E00", "white", "#0072B2"))(100), annotation_col=annot_columns, annotation_names_col = F, fontsize_col = 12, fontsize_row = 7, labels_row = row.names(siggc), show_rownames = F, filename = paste0("./figures/All_DEG_Heatmap_", paste(plotgroups, collapse = "_Vs_"), ".tiff")) # Generate dendrogram and heatmap for top 20 DE genes pheatmap(xtst[1:20,], color=colorRampPalette(c("#D55E00", "white", "#0072B2"))(100), annotation_col=annot_columns, annotation_names_col = F, fontsize_col = 14, fontsize_row = 12, labels_row = row.names(siggc), show_rownames = T, filename = paste0("./figures/Top_DEG_Heatmap_", paste(plotgroups, collapse = "_Vs_"), ".tiff")) # NOTE: you can plot more than 20 top genes by adjusting 'xtst[1:20,]'. If you wanted to plot the top 50 genes you'd change this to 'xtst[1:50,]' |