Table of Contents |
---|
Aim:
Identify statistically significant (FDR < 0.05) differentially expressed genes. Visualise results with PCA plots, heatmaps and volcano plots.
Requirements
Run your samples (FASTQ) using the nextflow nf-core/RNA-seq pipeline using ‘star_salmon’ (Task 3 session) or an alternative pipeline that generates feature counts.
Installing R and Rstudio
...
Table of Contents |
---|
Aim:
Identify statistically significant (FDR < 0.05) differentially expressed genes. Visualise results with PCA plots, heatmaps and volcano plots.
Requirements
Run your samples (FASTQ) using the nextflow nf-core/RNA-seq pipeline using ‘star_salmon’ (Task 3 session) or an alternative pipeline that generates feature counts.
Installing R and Rstudio
The analysis scripts in this guide are written in R script. We will be using RStudio, a front-end gui for R, to run the analysis scripts.
...
To access this count table:
Go to the W:\training\rnaseq\runs\run3_RNAseq\results folder that contains the results from running the nfcore/rnaseq pipeline. The output folders from task 3 look like this:
...
The count table can be found in the /results/star_salmon / folder. Let’s access the folder (i.e., cd /run3/results/star_salmon). A list of files and folders in the star_salmon folder will look like this:
...
a. First create a new folder in H:\workshop\RNAseq . Call it something suitable, such as ‘DE_analysis_workshop’
...
c. Copy the count table (the ‘salmon.merged.gene_counts.tsv
' file which you downloaded from the HPC) to the ) to the 'data’ folder you just created.
The count table looks like this:
Code Block |
---|
gene_id gene_name CD49fmNGFRm_rep1 CD49fmNGFRm_rep2 CD49fmNGFRm_rep3 CD49fpNGFRp_rep1 CD49fpNGFRp_rep2 CD49fpNGFRp_rep3 MTEC_rep1 MTEC_rep2 MTEC_rep3 ENSMUSG00000000001 Gnai3 2460 2395 2749 2686 3972 4419 7095 4484 6414 ENSMUSG00000000003 Pbsn 0 0 0 0 0 0 0 0 0 ENSMUSG00000000028 Cdc45 43 57 55 79 87.999 89 1241 830 1041.999 ENSMUSG00000000031 H19 2 0 1 17.082 24 16.077 200 139 145.604 ENSMUSG00000000037 Scml2 8 8 16 23 29.001 29 69 57 67 ENSMUSG00000000049 Apoh 1 0 2 1 2 0 0 0 2 |
...
SRR20622172 SRR20622173 SRR20622174 SRR20622175 SRR20622176 SRR20622177 SRR20622178 SRR20622179 SRR20622180
ENSMUSG00000000001 Gnai3 7086 4470 2457.002 2389 6398 2744 2681 3961 4399
ENSMUSG00000000003 Pbsn 0 0 0 0 0 0 0 0 0
ENSMUSG00000000028 Cdc45 1232.999 827 42 57 1036 55 78 88 89
ENSMUSG00000000031 H19 200 139 2 0 143.622 1 17.082 24 16.077
ENSMUSG00000000037 Scml2 70 57.001 8 8 66.999 16 23 27.999 29
ENSMUSG00000000049 Apoh 0 0 1 0 2 2 1 3 0
ENSMUSG00000000056 Narf 1933 1480 519 497 1730 539 365 458 536
ENSMUSG00000000058 Cav2 6008 3417 1347.001 1344 5482 1367 2669.001 4358 4365.832
ENSMUSG00000000078 Klf6 3809 2732 4413.001 3483.978 3559 4491 3209 3980 4626 |
d. In the same W:\training\rnaseq\runs\run3_RNAseq\results\star_salmon directory there will be a file called metadata.xlsx . Copy this file to your ‘data’ folder as well. This file will normally need to be manually created by you to match your sample IDs and treatment groups, but we created this file already for you to use. This samples table needs 3 columns called ‘sample_name’, containing the sample names seen in the count table (column names), ‘sample_ID’, which is the (less messy) names you want to call the samples in this analysis workflow, and ‘group’, which contains the treatment groups each sample belongs to. Sample metadataThe contents of this file look like this:
Code Block |
---|
sample_name sample_ID group CD49fmNGFRm_rep1SRR20622174 DC1 Differentiated_cells CD49fmNGFRm_rep2SRR20622175 DC2 Differentiated_cells CD49fmNGFRm_rep3SRR20622177 DC3 Differentiated_cells CD49fpNGFRp_rep1SRR20622178 BC1 Basal_cells CD49fpNGFRp_rep2SRR20622179 BC2 Basal_cells CD49fpNGFRp_rep3SRR20622180 BC3 Basal_cells MTEC_rep1SRR20622172 mTEC1 Murine_tracheal_epithelial_cell MTEC_rep2SRR20622173 mTEC2 Murine_tracheal_epithelial_cell MTEC_rep3SRR20622176 mTEC3 Murine_tracheal_epithelial_cell |
...
e. Open RStudio and create a new R script (‘File’ → “New File” → “R script”). Now hit ‘File’ → ‘Save’ and save the script in the analysis workshop folder you created in step a. (NOT IN THE ‘data’ FOLDER). Give the script file a name (e.g. DESEq2.R).
The following analysis contains R code (in the grey text boxes) that you can copy and paste, then run, into the R script you just created.
A copy of the full script is at /demo/run3_full_pipeline/
2. Install required R packages
...
We will be using the pretty heatmap package to accomplish this.
Code Block |
---|
#### 4b5b. Samples heatmap and dendrogram #### # This section plots a heatmap and dendrogram of pairwise relationships between samples. In this way you can see if samples cluster by treatment group. # See here: https://davetang.org/muse/2018/05/15/making-a-heatmap-in-r-with-the-pheatmap-package/ # Define annotation column annot_columns <- data.frame(meta$group[meta$group %in% plotgroups]) # Make the row names the sample IDs row.names(annot_columns) <- meta$sample_ID[meta$group %in% plotgroups] colnames(annot_columns) <- "Treatment groups" # Need to factorise it annot_columns[[1]] <- factor(annot_columns[[1]]) # Generate dendrogram and heatmap pheatmap(distance_matrix, color=colorRampPalette(c("white", "#9999FF", "#990000"))(50), cluster_rows = TRUE, show_rownames = TRUE, treeheight_row = 0, treeheight_col = 70, fontsize_col = 12, annotation_names_col = F, annotation_col = annot_columns, filename = paste0("./figures/Pairwise_sample_heatmap_", paste(plotgroups, collapse = "_Vs_"), ".tiff")) # Notes about heatmap colours. # You can change the colours used in the heatmap itself by changing the colour names (color=colorRampPalette....) # If you want to change the annotation colours, see here: https://zhiganglu.com/post/pheatmap_change_annotation_colors/ |
...
This section plots a heatmap and dendrogram of DE gene expression per sample. Expression counts are scaled and centered so that groupwise relationships can be examined.
Code Block |
---|
#### 5c6c. DE genes heatmaps and dendrograms #### # Make the row names gene symbols. DE_genes <- na.omit(DE_genes) row.names(DE_genes) <- make.unique(DE_genes$SYMBOL) # sort by p-value DE_genes <- DE_genes[order(DE_genes$padj), ] # Pull out normalised counts only siggc <- DE_genes[colnames(DE_genes) %in% colnames(expdata)] # Scale and center each row. This is important to visualise relative differences between groups and not have row-wise colouration dominated by high or low gene expression. xts <- scale(t(siggc)) xtst <- t(xts) # Define annotation column annot_columns <- data.frame(meta$group[meta$group %in% degroups]) # Make the row names the sample IDs row.names(annot_columns) <- meta$sample_ID[meta$group %in% degroups] colnames(annot_columns) <- "Treatment groups" # Need to factorise it annot_columns[[1]] <- factor(annot_columns[[1]]) # Generate dendrogram and heatmap for ALL DE genes pheatmap(xtst, color=colorRampPalette(c("#D55E00", "white", "#0072B2"))(100), annotation_col=annot_columns, annotation_names_col = F, fontsize_col = 12, fontsize_row = 7, labels_row = row.names(siggc), show_rownames = F, filename = paste0("./figures/All_DEG_Heatmap_", paste(plotgroups, collapse = "_Vs_"), ".tiff")) # Generate dendrogram and heatmap for top 20 DE genes pheatmap(xtst[1:20,], color=colorRampPalette(c("#D55E00", "white", "#0072B2"))(100), annotation_col=annot_columns, annotation_names_col = F, fontsize_col = 14, fontsize_row = 12, labels_row = row.names(siggc), show_rownames = T, filename = paste0("./figures/Top_DEG_Heatmap_", paste(plotgroups, collapse = "_Vs_"), ".tiff")) # NOTE: you can plot more than 20 top genes by adjusting 'xtst[1:20,]'. If you wanted to plot the top 50 genes you'd change this to 'xtst[1:50,]' |