2024-2: 5a.3 Checking for outliers and batch effects

1 Outliers and batch effects
- 1.1 5a. PCA plot
- 1.2 5b. Pairwise samples heatmap

Outliers and batch effects

In this section, we will create PCA plots and heatmaps to examine the relationships between samples. Outlier samples and batch effects can heavily bias your results and should be addressed (e.g. removal of outlier samples from the dataset) before any differential expression analysis is completed.

First, we need to prepare the data for plotting. Copy, paste, and run the following code in your R script. You will need to input which treatment groups you wish to plot (plotgroups <-) from the set of available treatment groups (which you can find out with unique(meta$group).

#### 5. Outliers and batch effects ####
# This section normalises and transforms the count data so that it can be plotted on a PCA plot and a heatmap
## USER INPUT
# Choose the groups you want to plot in a PCA/Heatmap. You can select any 2 or more of the groups (or all of the groups) you have in your 'groups' column of your metadata table
# To see what groups are present, run the following:
unique(meta$group)

# Now add which groups you want to plot (i.e. replace the groupnames below, and add more, separated by a comma and in "quotes", as needed). NOTE: R is case-sensitive, so these group names must be named EXACTLY the same as in the metadata table.
#plotgroups <- c("Murine_tracheal_epithelial_cell", "Basal_cells")
plotgroups <- c("Differentiated_cells", "Basal_cells")

# Pull out only the counts from the above groups
groupcounts <- counts[meta$group %in% plotgroups]

# Normalise counts by library size, using DeSeq2's estimateSizeFactors() function. Note that DeSeq2 does this internally during DEG calling. The normalisation below is done separately for PCA and density plotting.
# Set up the initial DeSeq2 experimental parameters.
condition <- factor(1:length(groupcounts))
# Set up the column data. A data frame of sample ID's and conditions
coldata <- data.frame(row.names=colnames(groupcounts), condition)
# Set up the DeSeq2 data set structure
f <- DESeqDataSetFromMatrix(countData = groupcounts, colData = coldata, design= ~ condition)
# Estimate the size factors. See DeSeq2 manual for details
f <- estimateSizeFactors(f)
# Size factors can be viewed by: sizeFactors(f)
# Multiply each row (sample) by the corresponding size factor
subcount_norm <- as.matrix(groupcounts) %*% diag(sizeFactors(f))
# Re-add column names
colnames(subcount_norm) <- colnames(groupcounts)

## Remove low coverage transcripts (mean count < 10) ##
# Find the mean of each row (and output as a data frame object)
means <- as.data.frame(rowMeans(subcount_norm))
# Then join the means data with the counts
means <- cbind(means, subcount_norm)
# Then subset out only genes with mean > 10
data <- subset(means, means[ , 1] > 10)
# Remove the means column
data <- data[,-1]

# Transform data
data_log <- vst(round(as.matrix(data)))
# Transformation can create some infinite values. Can't generate PCA data on these. Can see how many by: sum(sapply(data_log, is.infinite))
# To remove infinite rows, use 'is.finte' or '!is.infinite'
data_log <- data_log[is.finite(rowSums(data_log)),]
colnames(data_log) <- colnames(groupcounts)

### Set up the PCA plot base data ###
# We're using the FactoMineR package to generate PCA plots (http://factominer.free.fr/index.html)
# Need to transpose the data first
data_log_t <- t(data_log)
# Add the group data
data_log_t_vars <- data.frame(meta$group[meta$group %in% plotgroups], data_log_t)
# Generate the PCA data using FactoMineR package
res.pca <- PCA(data_log_t_vars, quali.sup = 1, graph=FALSE)
## Set up the dendogram/heatmaps base data ##
# Calculate the distance matrix:
distance_matrix <- as.matrix(dist(t(data_log)))
distance_matrix

5a. PCA plot

Now you can run the following code in your R script to generate the PCA plot.

We will be using the factoextra package to generate both the PCA data and the PCA plot.

#### 5a. PCA plot ####
# Generate the PCA plot. Groups are shaded with ellipses at 95% confidence level. NOTE: at least 4 replicates need to be in a group for an ellipses to be drawn.
# NOTE: change the group point colours by changing 'palette = ' below. Use the 'RColourBrewer' colour names (https://r-graph-gallery.com/38-rcolorbrewers-palettes.html). For example, if you are plotting 3 groups and choose palette = "Set1", this will use the first 3 colours from the Set1 colour palette.
p <- fviz_pca_ind(res.pca,
                  geom.ind = "point", # show points only (but not "text")
                  col.ind = meta$group[meta$group %in% plotgroups], # color by groups
                  pointsize = 5, title = "", legend.title = "Treatment groups", palette = "Dark2",
                  addEllipses = TRUE, ellipse.type = "t", ellipse.level = 0.95) + theme(legend.text = element_text(size = 12), legend.title = element_text(size = 14), axis.title=element_text(size=16), axis.text=element_text(size=14))
p
# Output as publication quality (300dpi) tiff and pdf.
# This will name your output files with the treatment groups you selected.
# Create a 'figures' subdirectory where all figures will be output
dir.create("figures", showWarnings = FALSE)
# Create a (300dpi) tiff
ggsave(file = paste0("./figures/PCA_", paste(plotgroups, collapse = "_Vs_"), ".tiff"), dpi = 300, compression = "lzw", device = "tiff", width = 10, height = 8, plot = p)
# Create a pdf
ggsave(file = paste0("./figures/PCA_", paste(plotgroups, collapse = "_Vs_"), ".pdf"), device = "pdf", width = 10, height = 8, plot = p)
# Create a png
ggsave(file = paste0("./figures/PCA_", paste(plotgroups, collapse = "_Vs_"), ".png"), device = "png", width = 10, height = 8, plot = p)

Figure 1. PCA showing the differentiation of Basal cells vs Differential cells

5b. Pairwise samples heatmap

Now generate the heatmap and dendrogram by copying and pasting this section of code.

We will be using the pretty heatmap package to accomplish this.

#### 5b. Samples heatmap and dendrogram ####
# This section plots a heatmap and dendrogram of pairwise relationships between samples. In this way you can see if samples cluster by treatment group.
# See here: https://davetang.org/muse/2018/05/15/making-a-heatmap-in-r-with-the-pheatmap-package/
# Define annotation column
annot_columns <- data.frame(meta$group[meta$group %in% plotgroups])
# Make the row names the sample IDs
row.names(annot_columns) <- meta$sample_ID[meta$group %in% plotgroups]
colnames(annot_columns) <- "Treatment groups"
# Need to factorise it
annot_columns[[1]] <- factor(annot_columns[[1]])
# Generate dendrogram and heatmap
pheatmap(distance_matrix, color=colorRampPalette(c("white", "#9999FF", "#990000"))(50), cluster_rows = TRUE, show_rownames = TRUE, treeheight_row = 0, treeheight_col = 70, fontsize_col = 12, annotation_names_col = F, annotation_col = annot_columns, filename = paste0("./figures/Pairwise_sample_heatmap_", paste(plotgroups, collapse = "_Vs_"), ".tiff"))
pheatmap(distance_matrix, color=colorRampPalette(c("white", "#9999FF", "#990000"))(50), cluster_rows = TRUE, show_rownames = TRUE, treeheight_row = 0, treeheight_col = 70, fontsize_col = 12, annotation_names_col = F, annotation_col = annot_columns, filename = paste0("./figures/Pairwise_sample_heatmap_", paste(plotgroups, collapse = "_Vs_"), ".pdf"))
pheatmap(distance_matrix, color=colorRampPalette(c("white", "#9999FF", "#990000"))(50), cluster_rows = TRUE, show_rownames = TRUE, treeheight_row = 0, treeheight_col = 70, fontsize_col = 12, annotation_names_col = F, annotation_col = annot_columns, filename = paste0("./figures/Pairwise_sample_heatmap_", paste(plotgroups, collapse = "_Vs_"), ".png"))
# Notes about heatmap colours.
# You can change the colours used in the heatmap itself by changing the colour names (color=colorRampPalette....)
# If you want to change the annotation colours, see here: https://zhiganglu.com/post/pheatmap_change_annotation_colors/

Pairwise_sample_heatmap_Differentiated_cells_Vs_Basal_cells.png

Figure 2: Heat map showing how samples cluster by treatment group

R packages - installing, loading and data importation - previous

Identify DE genes using DESeq2 - next