Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

Aim:

Identify statistically significant (FDR < 0.05) differentially expressed genes. Visualise results with PCA plots, heatmaps and volcano plots.

...

To access this count table:

Go to the/sandpit/demo/run3_full_pipeline/ folder that contains the results from running the nfcore/rnaseq pipeline. The output folders from task 3 look like this:

...

Now let’s find the full path to the ‘salmon.merged.gene_counts.tsv’ file:

  • Windows:

  • Mac:

    • cd /folder/that/contains/feature_counts/

    • pwd

  • Rstudio:

    • Open Rstudio, go to the top bar a click on “Session” → “Select working directory: → “Choose directory

    • The path to the directory will be printed in Rstudio console, copy and paste in line 1 of the script ‘RNAseq_DESeq2_analysis.R’ (see below)

...

First we need to prepare the data for plotting. Copy, paste and run the following code. you will need to input which treatment groups you wish to plot (plotgroups <-) from the set of available treatment groups.

Info

Paul, the column names of counts is the sample_ID not group, so the groupcounts below doesn’t work (ends up with 0 columns with 51912 rows)

Code Block
#### 5. Outliers and batch effects ####

# This section normalises and transforms the count data so that it can be plotted on a PCA plot and a heatmap

## USER INPUT
# Choose the groups you want to plot in a PCA/Heatmap. You can select any 2 or more of the groups (or all of the groups) you have in your 'groups' column of your metadata table
# To see what groups are present, run the following:
unique(meta$group)
# Now add which groups you want to plot (i.e. replace the groupnames below, and add more, separated by a comma and in "quotes", as needed). NOTE: R is case-sensitive, so these group names must be named EXACTLY the same as in the metadata table.
plotgroups <- c("Differentiated_cells", "Basal_cells")

# Pull out only the counts from the above groups
groupcounts <- counts[meta$group %in% plotgroups]

# Normalise counts by library size, using DeSeq2's estimateSizeFactors() function. Note that DeSeq2 does this internally during DEG calling. The normalisation below is done separately for PCA and density plotting.
# Set up the initial DeSeq2 experimental parameters.
condition <- factor(1:length(groupcounts))
# Set up the column data. A data frame of sample ID's and conditions
coldata <- data.frame(row.names=colnames(groupcounts), condition)
# Set up the DeSeq2 data set structure
f <- DESeqDataSetFromMatrix(countData = groupcounts, colData = coldata, design= ~ condition)
# Estimate the size factors. See DeSeq2 manual for details
f <- estimateSizeFactors(f)
# Size factors can be viewed by: sizeFactors(f)

# Multiply each row (sample) by the corresponding size factor
subcount_norm <- as.matrix(groupcounts) %*% diag(sizeFactors(f))
# Re-add column names
colnames(subcount_norm) <- colnames(groupcounts)

## Remove low coverage transcripts (mean count < 10) ##

# Find the mean of each row (and output as a data frame object)
means <- as.data.frame(rowMeans(subcount_norm))
# Then join the means data with the counts
means <- cbind(means, subcount_norm)
# Then subset out only genes with mean > 10
data <- subset(means, means[ , 1] > 10)
# Remove the means column
data <- data[,-1]

# Transform data
data_log <- vst(round(as.matrix(data)))
# Transformation can create some infinite values. Can't generate PCA data on these. Can see how many by: sum(sapply(data_log, is.infinite))
# To remove infinite rows, use 'is.finte' or '!is.infinite'
data_log <- data_log[is.finite(rowSums(data_log)),]
colnames(data_log) <- colnames(groupcounts)

### Set up the PCA plot base data ###

# We're using the FactoMineR package to generate PCA plots (http://factominer.free.fr/index.html)

# Need to transpose the data first
data_log_t <- t(data_log)
# Add the group data
data_log_t_vars <- data.frame(meta$group[meta$group %in% plotgroups], data_log_t)
# Generate the PCA data using FactoMineR package
res.pca <- PCA(data_log_t_vars, quali.sup = 1, graph=FALSE)

## Set up the dendogram/heatmaps base data ##

# Calculate the distance matrix:
distance_matrix <- as.matrix(dist(t(data_log)))
Warning

5a. PCA plot

Now you can run the following code in your R script to generate the PCA plot.

...