Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

Aim:

Identify statistically significant (FDR < 0.05) differentially expressed genes. Visualise results with PCA plots, heatmaps and volcano plots.

...

Go to https://rvdi.qut.edu.au/

Click on ‘VMware VMware Horizon HTML Access’Access

Log on with your QUT username and password

...

To access this count table:

Go to the/sandpit/demo/run3_full_pipeline/ W:\training\rnaseq\runs\run3_RNAseq\results folder that contains the results from running the nfcore/rnaseq pipeline. The output folders from task 3 look like this:

...

The count table can be found in the /results/star_salmon / folder. Let’s access the folder (i.e., cd /run3/results/star_salmon). A list of files and folders in the starA list of files and folders in the star_salmon folder will look like this:

...

The expression count file that we are interested is salmon.merged.gene_counts.tsv Let's see the content of the file by printing the top lines using the following command (in PuTTy):

Code Block
head salmon.merged.gene_counts.tsv

Differential Expression Analysis using DESeq2

the above command will print:

Code Block
gene_id gene_name       CD49fmNGFRm_rep1        CD49fmNGFRm_rep2        CD49fmNGFRm_rep3        CD49fpNGFRp_rep1        CD49fpNGFRp_rep2        CD49fpNGFRp_rep3     MTEC_rep1       MTEC_rep2       MTEC_rep3
ENSMUSG00000000001      Gnai3   2460    2395    2749    2686    3972    4419    7095    4484    6414
ENSMUSG00000000003      Pbsn    0       0       0       0       0       0       0       0       0
ENSMUSG00000000028      Cdc45   43      57      55      79      87.999  89      1241    830     1041.999
ENSMUSG00000000031      H19     2       0       1       17.082  24      16.077  200     139     145.604
ENSMUSG00000000037      Scml2   8       8       16      23      29.001  29      69      57      67
ENSMUSG00000000049      Apoh    1       0       2       1       2       0       0       0       2
ENSMUSG00000000056      Narf    522     496     539     368     457     538     1939    1483    1734
ENSMUSG00000000058      Cav2    1352.999        1349    1371.999        2684.001        4370    4386    6018.999        3429    5501
ENSMUSG00000000078      Klf6    4411    3492    4500    3221    3989    4637    3812    2741    3558

Now let’s copy the ‘salmon.merged.gene_counts.tsv’ file to your laptop/desktop using the file finder.

  • Windows - in the file finder type: \\hpc-fs\work\

  • Mac - open ‘Finder’ → press ‘command + k’ → type: ‘smb://hpc-fs/work/

Now let’s find the full path to the ‘salmon.merged.gene_counts.tsv’ file:

  • Windows:

  • Mac:

    • cd /folder/that/contains/feature_counts/

    • pwd

  • Rstudio:

    • Open Rstudio, go to the top bar a click on “Session” → “Select working directory: → “Choose directory

    • The path to the directory will be printed in Rstudio console, copy and paste in line 1 of the script ‘RNAseq_DESeq2_analysis.R’ (see below)

Differential Expression Analysis using DESeq2

We will now perform the following tasks using Rstudio:

  1. Preparing your data. Two data files needed for this analysis: a samples table and your count table

  2. Install required R packages (only need to run once) - after installation we only need to load the packages. NOTE: If using an rVDI virtual machine, the R packages are already installed

  3. Load required R packages. Unlike installing the packages, this needs to be done every time you run the analysis

  4. Import your data files (count table and samples table)into R

  5. Checking for outliers and batch effects

    1. PCA plot

    2. Pairwise samples heatmap

  6. Identify differentially expressed (DE) genes using DESeq2

    1. Annotating your DE genes

    2. Volcano plot

    3. DE genes heatmap

1. Preparing your data

You will need 2 data files to complete this analysis: your count table (see above) and a samples table.

a. First create a new folder, on your desktop, Documents, etc. Call it something informative, such as ‘DE analysis workshop’

b. Create a sub folder here called ‘data’. This is where your two data files will be stored

c. Copy the count table (the ‘salmon.merged.gene_counts.tsv' file which you downloaded from the HPC) to the 'data’ folder you just created

The count table looks like this:

Code Block
gene_id gene_name       CD49fmNGFRm_rep1        CD49fmNGFRm_rep2        CD49fmNGFRm_rep3        CD49fpNGFRp_rep1        CD49fpNGFRp_rep2        CD49fpNGFRp_rep3     MTEC_rep1       MTEC_rep2       MTEC_rep3
ENSMUSG00000000001      Gnai3   2460    2395    2749    2686    3972    4419    7095    4484    6414
ENSMUSG00000000003      Pbsn    0       0       0       0       0       0       0       0       0
ENSMUSG00000000028      Cdc45   43      57      55      79      87.999  89      1241    830     1041.999
ENSMUSG00000000031      H19     2       0       1       17.082  24      16.077  200     139     145.604
ENSMUSG00000000037      Scml2   8       8       16      23      29.001  29      69      57      67
ENSMUSG00000000049      Apoh    1       0       2       1       2       0       0       0       2

d. In Excel, create a samples table. Open a blank Excel document and copy and paste the sample metadata belowWe will now perform the following tasks using Rstudio:

  1. Preparing your data. Two data files needed for this analysis: a samples table and your count table

  2. Install required R packages (only need to run once) - after installation we only need to load the packages. NOTE: If using an rVDI virtual machine, the R packages are already installed

  3. Load required R packages. Unlike installing the packages, this needs to be done every time you run the analysis

  4. Import your data files (count table and samples table)into R

  5. Checking for outliers and batch effects

    1. PCA plot

    2. Pairwise samples heatmap

  6. Identify differentially expressed (DE) genes using DESeq2

    1. Annotating your DE genes

    2. Volcano plot

    3. DE genes heatmap

1. Preparing your data

You will need 2 data files to complete this analysis: your count table (see above) and a samples table.

a. First create a new folder in H:\workshop\RNAseq . Call it something suitable, such as ‘DE_analysis_workshop’

b. Create a sub folder here called ‘data’. This is where your two data files will be stored

c. Copy the count table (the ‘salmon.merged.gene_counts.tsv' file) to the 'data’ folder you just created.

The count table looks like this:

Code Block
gene_id	gene_name	SRR20622172	SRR20622173	SRR20622174	SRR20622175	SRR20622176	SRR20622177	SRR20622178	SRR20622179	SRR20622180
ENSMUSG00000000001	Gnai3	7086	4470	2457.002	2389	6398	2744	2681	3961	4399
ENSMUSG00000000003	Pbsn	0	0	0	0	0	0	0	0	0
ENSMUSG00000000028	Cdc45	1232.999	827	42	57	1036	55	78	88	89
ENSMUSG00000000031	H19	200	139	2	0	143.622	1	17.082	24	16.077
ENSMUSG00000000037	Scml2	70	57.001	8	8	66.999	16	23	27.999	29
ENSMUSG00000000049	Apoh	0	0	1	0	2	2	1	3	0
ENSMUSG00000000056	Narf	1933	1480	519	497	1730	539	365	458	536
ENSMUSG00000000058	Cav2	6008	3417	1347.001	1344	5482	1367	2669.001	4358	4365.832
ENSMUSG00000000078	Klf6	3809	2732	4413.001	3483.978	3559	4491	3209	3980	4626

d. In the same W:\training\rnaseq\runs\run3_RNAseq\results\star_salmon directory there will be a file called metadata.xlsx . Copy this file to your ‘data’ folder as well. This file will normally need to be manually created by you to match your sample IDs and treatment groups, but we created this file already for you to use. This samples table needs 3 columns called ‘sample_name’, containing the sample names seen in the count table (column names), ‘sample_ID’, which is the (less messy) names you want to call the samples in this analysis workflow, and ‘group’, which contains the treatment groups each sample belongs to. Sample metadataThe contents of this file look like this:

Code Block
sample_name	sample_ID	group
CD49fmNGFRm_rep1SRR20622174	DC1	 Differentiated_cells
CD49fmNGFRm_rep2SRR20622175	DC2	Differentiated_cells
CD49fmNGFRm_rep3SRR20622177	DC3	Differentiated_cells
CD49fpNGFRp_rep1SRR20622178	BC1	Basal_cells
CD49fpNGFRp_rep2SRR20622179	BC2	Basal_cells
CD49fpNGFRp_rep3SRR20622180	BC3	Basal_cells
MTEC_rep1SRR20622172	mTEC1	Murine_tracheal_epithelial_cell
MTEC_rep2SRR20622173	mTEC2	Murine_tracheal_epithelial_cell
MTEC_rep3SRR20622176	mTEC3	Murine_tracheal_epithelial_cell

...

e. Open RStudio and create a new R script (‘File’ → “New File” → “R script”). Now hit ‘File’ → ‘Save’ and save the script in the analysis workshop folder you created in step a. (NOT IN THE ‘data’ FOLDER). Give the script file a name (e.g. DESEq2.R).

The following analysis contains R code (in the grey text boxes) that you can copy and paste, then run, into the R script you just created.A copy of the full script is at /demo/run3_full_pipeline/

2. Install required R packages

...

We will be using the pretty heatmap package to accomplish this.

Code Block
#### 4b5b. Samples heatmap and dendrogram ####

# This section plots a heatmap and dendrogram of pairwise relationships between samples. In this way you can see if samples cluster by treatment group.

# See here: https://davetang.org/muse/2018/05/15/making-a-heatmap-in-r-with-the-pheatmap-package/

# Define annotation column
annot_columns <- data.frame(meta$group[meta$group %in% plotgroups])
# Make the row names the sample IDs
row.names(annot_columns) <- meta$sample_ID[meta$group %in% plotgroups]
colnames(annot_columns) <- "Treatment groups"
# Need to factorise it
annot_columns[[1]] <- factor(annot_columns[[1]])

# Generate dendrogram and heatmap
pheatmap(distance_matrix, color=colorRampPalette(c("white", "#9999FF", "#990000"))(50), cluster_rows = TRUE, show_rownames = TRUE, treeheight_row = 0, treeheight_col = 70, fontsize_col = 12, annotation_names_col = F, annotation_col = annot_columns, filename = paste0("./figures/Pairwise_sample_heatmap_", paste(plotgroups, collapse = "_Vs_"), ".tiff"))

# Notes about heatmap colours.
# You can change the colours used in the heatmap itself by changing the colour names (color=colorRampPalette....)
# If you want to change the annotation colours, see here: https://zhiganglu.com/post/pheatmap_change_annotation_colors/

...

This section plots a heatmap and dendrogram of DE gene expression per sample. Expression counts are scaled and centered so that groupwise relationships can be examined.

Code Block
#### 5c6c. DE genes heatmaps and dendrograms ####

# Make the row names gene symbols.
DE_genes <- na.omit(DE_genes)
row.names(DE_genes) <- make.unique(DE_genes$SYMBOL)

# sort by p-value
DE_genes <- DE_genes[order(DE_genes$padj), ]

# Pull out normalised counts only
siggc <- DE_genes[colnames(DE_genes) %in% colnames(expdata)]

# Scale and center each row. This is important to visualise relative differences between groups and not have row-wise colouration dominated by high or low gene expression.
xts <- scale(t(siggc))
xtst <- t(xts)

# Define annotation column
annot_columns <- data.frame(meta$group[meta$group %in% degroups])
# Make the row names the sample IDs
row.names(annot_columns) <- meta$sample_ID[meta$group %in% degroups]
colnames(annot_columns) <- "Treatment groups"
# Need to factorise it
annot_columns[[1]] <- factor(annot_columns[[1]])

# Generate dendrogram and heatmap for ALL DE genes
pheatmap(xtst, color=colorRampPalette(c("#D55E00", "white", "#0072B2"))(100), annotation_col=annot_columns, annotation_names_col = F, fontsize_col = 12, fontsize_row = 7, labels_row = row.names(siggc), show_rownames = F, filename = paste0("./figures/All_DEG_Heatmap_", paste(plotgroups, collapse = "_Vs_"), ".tiff"))

# Generate dendrogram and heatmap for top 20 DE genes
pheatmap(xtst[1:20,], color=colorRampPalette(c("#D55E00", "white", "#0072B2"))(100), annotation_col=annot_columns, annotation_names_col = F, fontsize_col = 14, fontsize_row = 12, labels_row = row.names(siggc), show_rownames = T, filename = paste0("./figures/Top_DEG_Heatmap_", paste(plotgroups, collapse = "_Vs_"), ".tiff"))
# NOTE: you can plot more than 20 top genes by adjusting 'xtst[1:20,]'. If you wanted to plot the top 50 genes you'd change this to 'xtst[1:50,]'