eResearch - Session 3 : Differential expression analysis using R

 

Aim:

Identify statistically significant (FDR < 0.05) differentially expressed genes. Visualise results with PCA plots, heatmaps and volcano plots.

Requirements

Run your samples (FASTQ) using the nextflow nf-core/RNA-seq pipeline using ‘star_salmon’ (Task 3 session) or an alternative pipeline that generates feature counts.

Installing R and Rstudio

The analysis scripts in this guide are written in R script. We will be using RStudio, a front-end gui for R, to run the analysis scripts.

You have three main options for running this analysis in RStudio:

  1. Use QUTs rVDI virtual desktop machines

  2. Install R and RStudio on your own PC

  3. Use the provided PCs in the QUT computer labs

 

Option1: Use QUTs rVDI virtual desktop machines

This is the preferred method, as R and RStudio are already installed, as are all the required R packages needed for analysis. Installing all of these can take over 30 minutes on your own PC, so using an rVDI machine saves time.

rVDI provides a virtual Windows desktop that can be run in your web browser.

To access and run an rVDI virtual desktop:

Go to https://rvdi.qut.edu.au/

Click on ‘VMware Horizon HTML Access

Log on with your QUT username and password

*NOTE: you need to be connected to the QUT network first, either being on campus or connecting remotely via VPN.

 

Option2: Install R and RStudio on your own PC

Go to the following page Posit and follow the instructions provided to install first R and then Rstudio.

Download and install R, following the default prompts:

Download R-4.4.2 for Windows. The R-project for statistical computing.

Download and install RStudio, following the default prompts:

https://posit.co/download/rstudio-desktop/

 

Option3: Use the provided PCs in the QUT computer labs

The PCs in the computer labs already have R and RStudio installed. If using this option, you will need to install the required R packages (unlike rVDI). The code for installing these packages is in the analysis section below.

 

Download your count table

The basis for the differential expression analysis is a count table of sequence reads mapped to defined gene regions per sample. There are a variety of methods to generate this count table, but for this exercise, we will be using the output from the Nextflow nfcore/rnaseq analysis you completed in the previous workshop sessions.

To access this count table:

Go to the W:\training\rnaseq\runs\run3_RNAseq\results folder that contains the results from running the nfcore/rnaseq pipeline. The output folders from task 3 look like this:

├── results │   ├── fastqc │   ├── multiqc │   ├── pipeline_info │   ├── star_salmon │   └── trimgalore

The count table can be found in the star_salmon folder. A list of files and folders in the star_salmon folder will look like this:

. ├── bigwig ├── CD49fmNGFRm_rep1 ├── CD49fmNGFRm_rep1.markdup.sorted.bam ├── CD49fmNGFRm_rep1.markdup.sorted.bam.bai ├── CD49fmNGFRm_rep2 ├── CD49fmNGFRm_rep2.markdup.sorted.bam ├── CD49fmNGFRm_rep2.markdup.sorted.bam.bai ├── CD49fmNGFRm_rep3 ├── CD49fmNGFRm_rep3.markdup.sorted.bam ├── CD49fmNGFRm_rep3.markdup.sorted.bam.bai ├── CD49fpNGFRp_rep1 ├── CD49fpNGFRp_rep1.markdup.sorted.bam ├── CD49fpNGFRp_rep1.markdup.sorted.bam.bai ├── CD49fpNGFRp_rep2 ├── CD49fpNGFRp_rep2.markdup.sorted.bam ├── CD49fpNGFRp_rep2.markdup.sorted.bam.bai ├── CD49fpNGFRp_rep3 ├── CD49fpNGFRp_rep3.markdup.sorted.bam ├── CD49fpNGFRp_rep3.markdup.sorted.bam.bai ├── deseq2_qc ├── dupradar ├── featurecounts ├── log ├── MTEC_rep1 ├── MTEC_rep1.markdup.sorted.bam ├── MTEC_rep1.markdup.sorted.bam.bai ├── MTEC_rep2 ├── MTEC_rep2.markdup.sorted.bam ├── MTEC_rep2.markdup.sorted.bam.bai ├── MTEC_rep3 ├── MTEC_rep3.markdup.sorted.bam ├── MTEC_rep3.markdup.sorted.bam.bai ├── picard_metrics ├── qualimap ├── rseqc ├── salmon.merged.gene_counts_length_scaled.rds ├── salmon.merged.gene_counts_length_scaled.tsv ├── salmon.merged.gene_counts.rds ├── salmon.merged.gene_counts_scaled.rds ├── salmon.merged.gene_counts_scaled.tsv ├── salmon.merged.gene_counts.tsv <---- We will use this feature counts file for DESeq2 expression analysis. ├── salmon.merged.gene_tpm.tsv ├── salmon.merged.transcript_counts.rds ├── salmon.merged.transcript_counts.tsv ├── salmon.merged.transcript_tpm.tsv ├── salmon_tx2gene.tsv ├── samtools_stats └── stringtie

The expression count file that we are interested is salmon.merged.gene_counts.tsv

head salmon.merged.gene_counts.tsv

Differential Expression Analysis using DESeq2

We will now perform the following tasks using Rstudio:

  1. Preparing your data. Two data files are needed for this analysis: a samples table and your count table

  2. Install required R packages (only need to run once) - after installation, we only need to load the packages. NOTE: If using an rVDI virtual machine, the R packages are already installed

  3. Load required R packages. Unlike installing the packages, this needs to be done every time you run the analysis

  4. Import your data files (count table and samples table)into R

  5. Checking for outliers and batch effects

    1. PCA plot

    2. Pairwise samples heatmap

  6. Identify differentially expressed (DE) genes using DESeq2

    1. Annotating your DE genes

    2. Volcano plot

    3. DE genes heatmap

 

1. Preparing your data

You will need 2 data files to complete this analysis: your count table (see above) and a samples table.

a. First create a new folder in H:\workshop\RNAseq . Call it something suitable, such as ‘DE_analysis_workshop’

 

b. Create a subfolder here called ‘data’. This is where your two data files will be stored

 

c. Copy the count table (the ‘salmon.merged.gene_counts.tsv' file) to the 'data’ folder you just created.

The count table looks like this:

 

d. In the same W:\training\rnaseq\runs\run3_RNAseq\results\star_salmon directory there will be a file called metadata.xlsx . Copy this file to your ‘data’ folder as well. This file will normally need to be manually created by you to match your sample IDs and treatment groups, but we created this file already for you to use. This samples table needs 3 columns called ‘sample_name’, containing the sample names seen in the count table (column names), ‘sample_ID’, which is the (less messy) names you want to call the samples in this analysis workflow, and ‘group’, which contains the treatment groups each sample belongs to. The contents of this file look like this:

 

e. Open RStudio and create a new R script (‘File’ → “New File” → “R script”). Now hit ‘File’ → ‘Save’ and save the script in the analysis workshop folder you created in step a. (NOT IN THE ‘data’ FOLDER). Give the script file a name (e.g. DESEq2.R).

 

The following analysis contains R code (in the grey text boxes) that you can copy and paste, then run, into the R script you just created.

2. Install required R packages

Copy and paste the following code into the R script you just created, then run the code (highlight all the code in your R script, then press the run button). This will install all the required packages and dependencies and may take 30 minutes or more to complete. It may prompt you occasionally to update packages - select 'a' for all if/when this occurs.

NOTE: you only need to run this section once on any laptop/PC, and you don’t need to run it if you’re using an rVDI machine.

 

 

3. Load required R packages

This section loads the packages you’ve installed in the previous section. Unlike installing packages, this needs to be run every time and should only take a few seconds to run.

 

 

4. Import your data files into R

In this section, we will import your count table and samples table into R.

You’ll need to change the ‘setwd' line to your working directory. Click ‘Session’ → ‘Set working directory’ → ‘Choose working directory’ and then choose the analysis workshop directory you created previously that contains your R script file and the 'Data’ directory.

 

 

5. Checking for outliers and batch effects

In this section, we will create PCA plots and heatmaps to examine the relationships between samples. Outlier samples and batch effects can heavily bias your results and should be addressed (e.g. removal of outlier samples from the dataset) before any differential expression analysis is completed.

First, we need to prepare the data for plotting. Copy, paste, and run the following code in your R script. you will need to input which treatment groups you wish to plot (plotgroups <-) from the set of available treatment groups (which you can find out with unique(meta$group).

 

5a. PCA plot

Now you can run the following code in your R script to generate the PCA plot.

We will be using the factoextra package to generate both the PCA data and the PCA plot.

 

5b. Pairwise samples heatmap

Now generate the heatmap and dendrogram.

We will be using the pretty heatmap package to accomplish this.

 

6. Identify differentially expressed (DE) genes using DESeq2

Now we come to the main analysis section of this workflow, where we will identify differentially expressed genes. This will generate two important datapoints per gene, the log fold change, which shows the change in expression levels between two treatment groups for a specific gene, and the adjusted p value, which shows which genes are significantly differentially expressed (adjusted to remove false positives).

The R package we use to find DE genes is DESeq2, “Differential gene expression analysis based on the negative binomial distribution”.

DESeq2 estimates variance-mean dependence in count data from high-throughput sequencing assays and tests for differential expression based on a model using the negative binomial distribution.

Copy and run the below code into your R script.

You’ll need to choose two treatment groups you want to compare (degroups <-), from the list of treatment groups in your samples table.

 

6a. Annotating your DE genes

The table of DE genes will have gene names according to the reference genome that was used to generate the count table. This is usually an Entrez gene ID if the NCBI genome was used, or an Ensemble gene ID if the Ensemble genome was used.

These Entrez/Ensembl gene IDs are a string of numbers that aren’t particularly informative. They can be individually looked to see what genes they represent, or we can annotate them all at the same time in R to match the Entrez/Ensembl gene IDs to their more commonly known gene symbol and description.

We'll be using bioconductor genome wide annotation packages to provide the annotation data. https://bioconductor.org/packages/3.17/data/annotation/

Annotation packages for human (org.Hs.eg.db), mouse (org.Mm.eg.db), rat (org.Rn.eg.db), E. coli strain K12 (org.EcK12.eg.db), E. coli strain Sakai (org.EcSakai.eg.db), zebrafish (org.Dr.eg.db) and Drosophila (org.Dm.eg.db) were installed with the required R packages. If your species is not in the above list, contact the eResearch team.

This workshop uses mouse data, so we will be using the org.Mm.eg.db annotation package here.

 

 

6b. Volcano plot

Now that we have our list of DE genes, we can visualise them with a volcano plot and heat map.

First, a volcano plot.

We’re using the EnhancedVolcano package ‘Publication-ready volcano plots with enhanced colouring and labeling’

The below plot only labels the top 20 DE genes, using mostly plot defaults. You can modify this code to label whatever genes you like, change plot colours, change axis labels, change log fold change cutoffs, etc, by following the EnhancedVolcano guide here:https://bioconductor.org/packages/release/bioc/html/EnhancedVolcano.html

 

6c. DE genes heatmap

This section plots a heatmap and dendrogram of DE gene expression per sample. Expression counts are scaled and centered so that groupwise relationships can be examined.