Table of Contents |
---|
Aim:
Identify statistically significant (FDR < 0.05) differentially expressed genes. Visualise results with PCA plots, heatmaps and volcano plots.
Requirements
Run your samples (FASTQ) using the nextflow nf-core/RNA-seq pipeline using ‘star_salmon’ (Task 3 session) or an alternative pipeline that generates feature counts.
Installing R and Rstudio
The analysis scripts in this guide are written in R script. We will be using RStudio, a front-end gui for R, to run the analysis scripts.
You have three main options for running this analysis in RStudio:
...
Use QUTs rVDI virtual desktop machines
Table of Contents |
---|
Aim:
Identify statistically significant (FDR < 0.05) differentially expressed genes. Visualise results with PCA plots, heatmaps and volcano plots.
Requirements
Run your samples (FASTQ) using the nextflow nf-core/RNA-seq pipeline using ‘star_salmon’ (Task 3 session) or an alternative pipeline that generates feature counts.
Installing R and Rstudio
The analysis scripts in this guide are written in R script. We will be using RStudio, a front-end gui for R, to run the analysis scripts.
You have three main options for running this analysis in RStudio:
Use QUTs rVDI virtual desktop machines
Install R and RStudio on your own PC
Use the provided PCs in the QUT computer labs
Option1: Use QUTs rVDI virtual desktop machines
This is the preferred method, as R and RStudio are already installed, as are all the required R packages needed for analysis. Installing all of these can take over 30 minutes on your own PC, so using an rVDI machine saves time.
rVDI provides a virtual Windows desktop that can be run in your web browser.
To access and run an rVDI virtual desktop:
Go to https://rvdi.qut.edu.au/
Click on ‘VMware Horizon HTML Access’
Log on with your QUT username and password
*NOTE: you need to be connected to the QUT network first, either being on campus or connecting remotely via VPN.
Option2: Install R and RStudio on your own PC
...
Use the provided PCs in the QUT computer labs
Option1: Use QUTs rVDI virtual desktop machines
This is the preferred method, as R and RStudio are already installed, as are all the required R packages needed for analysis. Installing all of these can take over 30 minutes on your own PC, so using an rVDI machine saves time.
rVDI provides a virtual Windows desktop that can be run in your web browser.
To access and run an rVDI virtual desktop:
Go to https://rvdi.qut.edu.au/
Click on ‘VMware Horizon HTML Access’
Log on with your QUT username and password
*NOTE: you need to be connected to the QUT network first, either being on campus or connecting remotely via VPN.
Option2: Install R and RStudio on your own PC
Go to the following page https://posit.co/download/rstudio-desktop/ and follow instructions provided to install first R and then Rstudio.
Download and install R, following the default prompts:
https://cran.r-project.org/bin/windows/base/
Download and install RStudio, following the default prompts:
https://posit.co/download/rstudio-desktop/
Option3: Use the provided PCs in the QUT computer labs
The PCs in the computer labs already have R and RStudio installed. If using this option, you will need to install the required R packages (unlike rVDI). The code for installing these packages is in the analysis section below.
Download your count table
The basis for the differential expression analysis is a count table of sequence reads mapped to defined gene regions per sample. There are a variety of methods to generate this count table, but for this exercise we will be using the output from the Nextflow nfcore/rnaseq analysis you completed in the previous workshop sessions.
To access this count table:
...
Code Block |
---|
├── results
│ ├── fastqc
│ ├── multiqc
│ ├── pipeline_info
│ ├── star_salmon
│ └── trimgalore |
The count table can be found in the /results/star_salmon/ folder. Let’s access the folder (i.e., cd /run3/results/star_salmon). A list of files and folders in the star_salmon folder will look like this:
Code Block |
---|
.
├── bigwig
├── CD49fmNGFRm_rep1
├── CD49fmNGFRm_rep1.markdup.sorted.bam
├── CD49fmNGFRm_rep1.markdup.sorted.bam.bai
├── CD49fmNGFRm_rep2
├── CD49fmNGFRm_rep2.markdup.sorted.bam
├── CD49fmNGFRm_rep2.markdup.sorted.bam.bai
├── CD49fmNGFRm_rep3
├── CD49fmNGFRm_rep3.markdup.sorted.bam
├── CD49fmNGFRm_rep3.markdup.sorted.bam.bai
├── CD49fpNGFRp_rep1
├── CD49fpNGFRp_rep1.markdup.sorted.bam
├── CD49fpNGFRp_rep1.markdup.sorted.bam.bai
├── CD49fpNGFRp_rep2
├── CD49fpNGFRp_rep2.markdup.sorted.bam
├── CD49fpNGFRp_rep2.markdup.sorted.bam.bai
├── CD49fpNGFRp_rep3
├── CD49fpNGFRp_rep3.markdup.sorted.bam
├── CD49fpNGFRp_rep3.markdup.sorted.bam.bai
├── deseq2_qc
├── dupradar
├── featurecounts
├── log
├── MTEC_rep1
├── MTEC_rep1.markdup.sorted.bam
├── MTEC_rep1.markdup.sorted.bam.bai
├── MTEC_rep2
├── MTEC_rep2.markdup.sorted.bam
├── MTEC_rep2.markdup.sorted.bam.bai
├── MTEC_rep3
├── MTEC_rep3.markdup.sorted.bam
├── MTEC_rep3.markdup.sorted.bam.bai
├── picard_metrics
├── qualimap
├── rseqc
├── salmon.merged.gene_counts_length_scaled.rds
├── salmon.merged.gene_counts_length_scaled.tsv
├── salmon.merged.gene_counts.rds
├── salmon.merged.gene_counts_scaled.rds
├── salmon.merged.gene_counts_scaled.tsv
├── salmon.merged.gene_counts.tsv <---- We will use this feature counts file for DESeq2 expression analysis.
├── salmon.merged.gene_tpm.tsv
├── salmon.merged.transcript_counts.rds
├── salmon.merged.transcript_counts.tsv
├── salmon.merged.transcript_tpm.tsv
├── salmon_tx2gene.tsv
├── samtools_stats
└── stringtie |
The expression count file that we are interested is salmon.merged.gene_counts.tsv
Let's see the content of the file by printing the top lines using the following command (in PuTTy):
Code Block |
---|
head salmon.merged.gene_counts.tsv |
the above command will print:
Code Block |
---|
gene_id gene_name CD49fmNGFRm_rep1 CD49fmNGFRm_rep2 CD49fmNGFRm_rep3 CD49fpNGFRp_rep1 CD49fpNGFRp_rep2 CD49fpNGFRp_rep3 MTEC_rep1 MTEC_rep2 MTEC_rep3
ENSMUSG00000000001 Gnai3 2460 2395 2749 2686 3972 4419 7095 4484 6414
ENSMUSG00000000003 Pbsn 0 0 0 0 0 0 0 0 0
ENSMUSG00000000028 Cdc45 43 57 55 79 87.999 89 1241 830 1041.999
ENSMUSG00000000031 H19 2 0 1 17.082 24 16.077 200 139 145.604
ENSMUSG00000000037 Scml2 8 8 16 23 29.001 29 69 57 67
ENSMUSG00000000049 Apoh 1 0 2 1 2 0 0 0 2
ENSMUSG00000000056 Narf 522 496 539 368 457 538 1939 1483 1734
ENSMUSG00000000058 Cav2 1352.999 1349 1371.999 2684.001 4370 4386 6018.999 3429 5501
ENSMUSG00000000078 Klf6 4411 3492 4500 3221 3989 4637 3812 2741 3558 |
Now let’s copy the ‘salmon.merged.gene_counts.tsv’ file to your laptop/desktop using the file finder.
|
---|
Now let’s find the full path to the ‘salmon.merged.gene_counts.tsv’ file:
...
Windows:
...
Mac:
cd /folder/that/contains/feature_counts/
pwd
Rstudio:
...
Open Rstudio, go to the top bar a click on “Session” → “Select working directory: → “Choose directory”
...
Go to the following page https://posit.co/download/rstudio-desktop/ and follow instructions provided to install first R and then Rstudio.
Download and install R, following the default prompts:
https://cran.r-project.org/bin/windows/base/
Download and install RStudio, following the default prompts:
https://posit.co/download/rstudio-desktop/
Option3: Use the provided PCs in the QUT computer labs
The PCs in the computer labs already have R and RStudio installed. If using this option, you will need to install the required R packages (unlike rVDI). The code for installing these packages is in the analysis section below.
Download your count table
The basis for the differential expression analysis is a count table of sequence reads mapped to defined gene regions per sample. There are a variety of methods to generate this count table, but for this exercise we will be using the output from the Nextflow nfcore/rnaseq analysis you completed in the previous workshop sessions.
To access this count table:
Go to the W:\training\rnaseq\runs\run3_RNAseq\results folder that contains the results from running the nfcore/rnaseq pipeline. The output folders from task 3 look like this:
Code Block |
---|
├── results
│ ├── fastqc
│ ├── multiqc
│ ├── pipeline_info
│ ├── star_salmon
│ └── trimgalore |
The count table can be found in the /results/star_salmon/ folder. Let’s access the folder (i.e., cd /run3/results/star_salmon). A list of files and folders in the star_salmon folder will look like this:
Code Block |
---|
.
├── bigwig
├── CD49fmNGFRm_rep1
├── CD49fmNGFRm_rep1.markdup.sorted.bam
├── CD49fmNGFRm_rep1.markdup.sorted.bam.bai
├── CD49fmNGFRm_rep2
├── CD49fmNGFRm_rep2.markdup.sorted.bam
├── CD49fmNGFRm_rep2.markdup.sorted.bam.bai
├── CD49fmNGFRm_rep3
├── CD49fmNGFRm_rep3.markdup.sorted.bam
├── CD49fmNGFRm_rep3.markdup.sorted.bam.bai
├── CD49fpNGFRp_rep1
├── CD49fpNGFRp_rep1.markdup.sorted.bam
├── CD49fpNGFRp_rep1.markdup.sorted.bam.bai
├── CD49fpNGFRp_rep2
├── CD49fpNGFRp_rep2.markdup.sorted.bam
├── CD49fpNGFRp_rep2.markdup.sorted.bam.bai
├── CD49fpNGFRp_rep3
├── CD49fpNGFRp_rep3.markdup.sorted.bam
├── CD49fpNGFRp_rep3.markdup.sorted.bam.bai
├── deseq2_qc
├── dupradar
├── featurecounts
├── log
├── MTEC_rep1
├── MTEC_rep1.markdup.sorted.bam
├── MTEC_rep1.markdup.sorted.bam.bai
├── MTEC_rep2
├── MTEC_rep2.markdup.sorted.bam
├── MTEC_rep2.markdup.sorted.bam.bai
├── MTEC_rep3
├── MTEC_rep3.markdup.sorted.bam
├── MTEC_rep3.markdup.sorted.bam.bai
├── picard_metrics
├── qualimap
├── rseqc
├── salmon.merged.gene_counts_length_scaled.rds
├── salmon.merged.gene_counts_length_scaled.tsv
├── salmon.merged.gene_counts.rds
├── salmon.merged.gene_counts_scaled.rds
├── salmon.merged.gene_counts_scaled.tsv
├── salmon.merged.gene_counts.tsv <---- We will use this feature counts file for DESeq2 expression analysis.
├── salmon.merged.gene_tpm.tsv
├── salmon.merged.transcript_counts.rds
├── salmon.merged.transcript_counts.tsv
├── salmon.merged.transcript_tpm.tsv
├── salmon_tx2gene.tsv
├── samtools_stats
└── stringtie |
The expression count file that we are interested is salmon.merged.gene_counts.tsv
Code Block |
---|
head salmon.merged.gene_counts.tsv |
Differential Expression Analysis using DESeq2
...
a. First create a new folder , on your desktop, Documents, etcin H:\workshop\RNAseq . Call it something informativesuitable, such as ‘DE_analysis_workshop’
b. Create a sub folder here called ‘data’. This is where your two data files will be stored
...