Aim:
Identify statistically significant (FDR < 0.05) differentially expressed genes.
Requirements
Run your samples (FASTQ) using the nextflow nf-core/RNA-seq pipeline using ‘star_salmon’ (Task 3 session) or an alternative pipeline that generates feature counts.
Installing R and Rstudio
You will need to have both R and Studio installed in your laptop/desktop for today’s exercises.
Go to the following page https://posit.co/download/rstudio-desktop/ and follow instructions provided to install first R and then Rstudio.
Install R:
https://cran.r-project.org/bin/windows/base/
install Rstudio:
https://posit.co/download/rstudio-desktop/
RNAseq gene expression feature counts
Go to the /sandpit/demo/run3_full_pipeline/ folder that contains the results for running the complete RNAseq pipeline. The output folders from task 3 look like this:
├── results │ ├── fastqc │ ├── multiqc │ ├── pipeline_info │ ├── star_salmon │ └── trimgalore
The gene expression feature counts can be found in the /results/star_salmon/ folder. Let’s access the folder (i.e., cd /run3/results/star_salmon). A list of files and folders in the star_salmon folder will look like this:
. ├── bigwig ├── CD49fmNGFRm_rep1 ├── CD49fmNGFRm_rep1.markdup.sorted.bam ├── CD49fmNGFRm_rep1.markdup.sorted.bam.bai ├── CD49fmNGFRm_rep2 ├── CD49fmNGFRm_rep2.markdup.sorted.bam ├── CD49fmNGFRm_rep2.markdup.sorted.bam.bai ├── CD49fmNGFRm_rep3 ├── CD49fmNGFRm_rep3.markdup.sorted.bam ├── CD49fmNGFRm_rep3.markdup.sorted.bam.bai ├── CD49fpNGFRp_rep1 ├── CD49fpNGFRp_rep1.markdup.sorted.bam ├── CD49fpNGFRp_rep1.markdup.sorted.bam.bai ├── CD49fpNGFRp_rep2 ├── CD49fpNGFRp_rep2.markdup.sorted.bam ├── CD49fpNGFRp_rep2.markdup.sorted.bam.bai ├── CD49fpNGFRp_rep3 ├── CD49fpNGFRp_rep3.markdup.sorted.bam ├── CD49fpNGFRp_rep3.markdup.sorted.bam.bai ├── deseq2_qc ├── dupradar ├── featurecounts ├── log ├── MTEC_rep1 ├── MTEC_rep1.markdup.sorted.bam ├── MTEC_rep1.markdup.sorted.bam.bai ├── MTEC_rep2 ├── MTEC_rep2.markdup.sorted.bam ├── MTEC_rep2.markdup.sorted.bam.bai ├── MTEC_rep3 ├── MTEC_rep3.markdup.sorted.bam ├── MTEC_rep3.markdup.sorted.bam.bai ├── picard_metrics ├── qualimap ├── rseqc ├── salmon.merged.gene_counts_length_scaled.rds ├── salmon.merged.gene_counts_length_scaled.tsv ├── salmon.merged.gene_counts.rds ├── salmon.merged.gene_counts_scaled.rds ├── salmon.merged.gene_counts_scaled.tsv ├── salmon.merged.gene_counts.tsv <---- We will use this feature counts file for DESeq2 expression analysis. ├── salmon.merged.gene_tpm.tsv ├── salmon.merged.transcript_counts.rds ├── salmon.merged.transcript_counts.tsv ├── salmon.merged.transcript_tpm.tsv ├── salmon_tx2gene.tsv ├── samtools_stats └── stringtie
The expression count file that we are interested is salmon.merged.gene_counts.tsv
Let's see the content of the file by printing the top lines using the following command:
head salmon.merged.gene_counts.tsv
the above command will print:
gene_id gene_name CD49fmNGFRm_rep1 CD49fmNGFRm_rep2 CD49fmNGFRm_rep3 CD49fpNGFRp_rep1 CD49fpNGFRp_rep2 CD49fpNGFRp_rep3 MTEC_rep1 MTEC_rep2 MTEC_rep3 ENSMUSG00000000001 Gnai3 2460 2395 2749 2686 3972 4419 7095 4484 6414 ENSMUSG00000000003 Pbsn 0 0 0 0 0 0 0 0 0 ENSMUSG00000000028 Cdc45 43 57 55 79 87.999 89 1241 830 1041.999 ENSMUSG00000000031 H19 2 0 1 17.082 24 16.077 200 139 145.604 ENSMUSG00000000037 Scml2 8 8 16 23 29.001 29 69 57 67 ENSMUSG00000000049 Apoh 1 0 2 1 2 0 0 0 2 ENSMUSG00000000056 Narf 522 496 539 368 457 538 1939 1483 1734 ENSMUSG00000000058 Cav2 1352.999 1349 1371.999 2684.001 4370 4386 6018.999 3429 5501 ENSMUSG00000000078 Klf6 4411 3492 4500 3221 3989 4637 3812 2741 3558
Now let’s copy the ‘salmon.merged.gene_counts.tsv’ file to your laptop/desktop using the file finder.
|
---|
Now let’s find the full path to the ‘salmon.merged.gene_counts.tsv’ file:
|
---|
Differential Expression Analysis using DESeq2
We will now perform the following tasks using Rstudio:
install R package manager
install R packages (only needed once) - After installation we only need to load the packages
set working directory (directory where the gene counts file is located)
run DESeq2 analysis, export results and generate plots
Open R Studio and if not prompted automatically, create a R script file as follows:
Open Rstudio → click on ‘File’ in the top bar → select “New File” → select “R script”
Install R package manager
We will use Bioconductor to install the “BiocManager”. Copy and paste the following code into the R studio script. Select all three lines in the R script and click on “Run”. Note: version 3.17 requires a R version 4.3 (which.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(version = "3.17")
Install / Load packages:
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("DESeq2") #load packages library("DESeq2") library(ggplot2) library("ggrepel") library("EnhancedVolcano") #install.packages("dendsort") library(dendsort) library("reshape2") library("plyr") library(Cairo) library(pheatmap) library("dendextend") require(reshape2)
Prepare INPUT files
INPUT 1: Create a folder called ‘data’ and copy the gene counts file to a folder. For example, the gene counts table looks like this:
gene_id gene_name CD49fmNGFRm_rep1 CD49fmNGFRm_rep2 CD49fmNGFRm_rep3 CD49fpNGFRp_rep1 CD49fpNGFRp_rep2 CD49fpNGFRp_rep3 MTEC_rep1 MTEC_rep2 MTEC_rep3 ENSMUSG00000000001 Gnai3 2460 2395 2749 2686 3972 4419 7095 4484 6414 ENSMUSG00000000003 Pbsn 0 0 0 0 0 0 0 0 0 ENSMUSG00000000028 Cdc45 43 57 55 79 87.999 89 1241 830 1041.999 ENSMUSG00000000031 H19 2 0 1 17.082 24 16.077 200 139 145.604 ENSMUSG00000000037 Scml2 8 8 16 23 29.001 29 69 57 67 ENSMUSG00000000049 Apoh 1 0 2 1 2 0 0 0 2 ENSMUSG00000000056 Narf 522 496 539 368 457 538 1939 1483 1734 ENSMUSG00000000058 Cav2 1352.999 1349 1371.999 2684.001 4370 4386 6018.999 3429 5501 ENSMUSG00000000078 Klf6 4411 3492 4500 3221 3989 4637 3812 2741 3558 ENSMUSG00000000085 Scmh1 1561.002 1549.002 1694.001 2276.001 2663 3290.999 2437.001 1759.001 2325.002 ENSMUSG00000000088 Cox5a 1136 1425 1691 947 1410 1613.998 3173.999 1873.992 2865.998 ENSMUSG00000000093 Tbx2 74 40 66 354 380 460 444 295 436 ENSMUSG00000000094 Tbx4 1 0 0 0 0 0 1 1 1 ENSMUSG00000000103 Zfy2 0 0 0 0 0 0 0 0 0 ENSMUSG00000000120 Ngfr 379 313 363 6577 7083 8670 4050 3336 3534 ENSMUSG00000000125 Wnt3 11 19 21 1 0 0 3 0 3 ENSMUSG00000000126 Wnt9a 69 104 89 405 480 554 1663 1012 1730 ENSMUSG00000000127 Fer 696 776 898 558 777 958 911 700 892 ENSMUSG00000000131 Xpo6 2206 2172 2408 2402 2866 3327.002 3919 3204 3763
INPUT 2: Sample metadata:
sample_name sample_ID group CD49fmNGFRm_rep1 DC1 Differentiated_cells CD49fmNGFRm_rep2 DC2 Differentiated_cells CD49fmNGFRm_rep3 DC3 Differentiated_cells CD49fpNGFRp_rep1 BC1 Basal_cells CD49fpNGFRp_rep2 BC2 Basal_cells CD49fpNGFRp_rep3 BC3 Basal_cells MTEC_rep1 mTEC1 Murine_tracheal_epithelial_cell MTEC_rep2 mTEC2 Murine_tracheal_epithelial_cell MTEC_rep3 mTEC3 Murine_tracheal_epithelial_cell
R script - run the script using Rstudio
A copy of the script is at /demo/run3_full_pipeline/
RNAseq_DESeq2.R script:
# Differential expression analysis # When you see '## USER INPUT', this means you have to modify the code for your computer or dataset. All other code can be run as-is (i.e. you don't need to understand the code, just run it) #### Installing required packages #### # **NOTE: this section only needs to be run once (or occasionally to update the packages) # Install devtools install.packages("devtools", repos = "http://cran.us.r-project.org") # Install R packages. This only needs to be run once. bioconductor_packages <- c("DESeq2", "EnhancedVolcano") cran_packages <- c("ggrepel", "ggplot2", "plyr", "reshape2", "readxl", "FactoMineR", "factoextra") # Compares installed packages to above packages and returns a vector of missing packages new_packages <- bioconductor_packages[!(bioconductor_packages %in% installed.packages()[,"Package"])] new_cran_packages <- cran_packages[!(cran_packages %in% installed.packages()[,"Package"])] # Install missing bioconductor packages if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(new_packages) # Install missing cran packages if (length(new_cran_packages)) install.packages(new_cran_packages, repos = "http://cran.us.r-project.org") # Update all installed packages are the latest version update.packages(bioconductor_packages, ask = FALSE) update.packages(cran_packages, ask = FALSE, repos = "http://cran.us.r-project.org") #### Loading required packages #### # This section needs to be run very time # Load packages bioconductor_packages <- c("DESeq2", "EnhancedVolcano") cran_packages <- c("ggrepel", "ggplot2", "plyr", "reshape2", "readxl", "FactoMineR", "factoextra") lapply(cran_packages, require, character.only = TRUE) lapply(bioconductor_packages, require, character.only = TRUE) #### Import your count data #### # Make sure you have: a) your count table (salmon.merged.gene_counts.tsv file, if you used Nextflow nfcore/rnaseq to analyse your data). Copy this to a subdirectory called 'data'. b) your metadata file. This should be either an Excel file called 'metadata.xlsx' or a tab-separated text file called 'metadata.txt'. It needs 3 columns called 'sample_name', 'sample_ID' and 'group'. The sample names should be EXACTLY the same as the names in the count table. These names are often uninformative and long, so the 'sample_ID' is the sample labels you want to put on your plots. E.g. if you have a 'high fat' group, you might want to rename the samples HF1, HF2, HF3, etc) ## USER INPUT # Set working directory. # Change this to your working directory (In the RStudio menu: Session -> Set working directory -> Choose working directory) setwd("C:/Users/whatmorp/OneDrive - Queensland University of Technology/Desktop/Projects/RNA-Seq downstream analysis") # Import your count data. make sure you've created a 'data' subdirectory and put the count table file there. metacountdata <- read.table("./data/salmon.merged.gene_counts.tsv", header = TRUE, row.names = 1) # Import metadata (option 1: text file). Again, need a metadata.txt file in the data subdirectory. meta <- read.table("./data/metadata.txt", header = T) # Import metadata (option 1: Excel file)) #meta <- read_excel("./data/metadata.xlsx") # Remove 1st columns of metadata (gene_name) counts <- metacountdata[ ,2:ncol(metacountdata)] # Rename sample names to new sample IDs counts <- counts[as.character(meta$sample_name)] colnames(counts) <- meta$sample_ID # Counts need to be rounded to integers counts <- ceiling(counts) #### Outliers and batch effects #### # This section normalises and transforms the count data so that it can be plotted on a PCA plot and a heatmap # Normalise counts by library size ## # Using DeSeq2's estimateSizeFactors(). Note that DeSeq2 does this internally during DEG calling. The normalisation below is done separately for PCA and density plotting. ## USER INPUT # Choose the groups you want to plot in a PCA/Heatmap. You can select any 2 or more of the groups (or all of the groups) you have in your 'groups' column of your metadata table # To see what groups are present, run the following: unique(meta$group) # Now add which groups you want to plot (i.e. replace the groupnames below, and add more, separated by a comma and in "quotes", as needed). NOTE: R is case-sensitive, so these group names must be named EXACTLY the same as in the metadata table. plotgroups <- c("Differentiated_cells", "Basal_cells") # Pull out only the counts from the above groups groupcounts <- counts[meta$group %in% plotgroups] # Set up the initial DeSeq2 experimental parameters. This is the length of the no' of samples being analysed. In full DeSeq2 analysis this is to separate groups as factors, but here a simple count is fine. condition <- factor(1:length(groupcounts)) # Set up the column data. A data frame of sample ID's and conditions coldata <- data.frame(row.names=colnames(groupcounts), condition) # Set up the DeSeq2 data set structure f <- DESeqDataSetFromMatrix(countData = groupcounts, colData = coldata, design= ~ condition) # Estimate the size factors. See DeSeq2 manual for details f <- estimateSizeFactors(f) # Size factors can be viewed by: sizeFactors(f) # Multiply each row (sample) by the corresponding size factor subcount_norm <- as.matrix(groupcounts) %*% diag(sizeFactors(f)) # Re-add column names colnames(subcount_norm) <- colnames(groupcounts) ## Remove low coverage transcripts (mean count < 10) ## # Find the mean of each row (and output as a data frame object) means <- as.data.frame(rowMeans(subcount_norm)) # Then join the means data with the counts means <- cbind(means, subcount_norm) # Then subset out only genes with mean > 10 data <- subset(means, means[ , 1] > 10) # Remove the means column data <- data[,-1] # Transform data data_log <- vst(round(as.matrix(data))) # Transformation can create some infinite values. Can't generate PCA data on these. Can see how many by: sum(sapply(data_log, is.infinite)) # To remove infinite rows, use 'is.finte' or '!is.infinite' data_log <- data_log[is.finite(rowSums(data_log)),] colnames(data_log) <- colnames(groupcounts) ### Set up the PCA plot base data ### # We're using the FactoMineR package to generate PCA plots (http://factominer.free.fr/index.html) # Need to transpose the data first data_log_t <- t(data_log) # Add the group data data_log_t_vars <- data.frame(meta$group[meta$group %in% plotgroups], data_log_t) # Generate the PCA data using FactoMineR package res.pca <- PCA(data_log_t_vars, quali.sup = 1, graph=FALSE) ## Set up the dendogram/heatmaps base data ## # Calculate the distance matrix: distance_matrix <- as.matrix(dist(t(data_log))) #### PCA plot #### # Generate the PCA plot. Groups are shaded with elipses at 95% confidence level. NOTE: at least 4 replicates need to be in a group for an elipses to be drawn. p <- fviz_pca_ind(res.pca, geom.ind = "point", # show points only (but not "text") col.ind = meta$group[meta$group %in% plotgroups], # color by groups pointsize = 5, title = "", addEllipses = TRUE, ellipse.type = "t", ellipse.level = 0.95) + theme(legend.text = element_text(size = 12), legend.title = element_text(size = 14), axis.title=element_text(size=16), axis.text=element_text(size=14)) p ## USER INPUT. Change these to pdf_filename <- "pca.pdf" tiff_filename <- "pca.tiff" # Output as publication quality (300dpi) tiff and pdf (Note: this will create a 'figures subdirectory where all figures will be output) dir.create("figures", showWarnings = FALSE) ggsave(file = paste0("./figures/", tiff_filename), dpi = 300, compression = "lzw", device = "tiff", width = 10, height = 8, plot = p) ggsave(file = paste0("./figures/", pdf_filename), device = "pdf", width = 10, height = 8, plot = p) #### Heatmap and dendrogram ####