Content Comparison

Table of Contents

...

Open RStudio (you can type it in the Windows search bar)
Create a new R script: ‘File’ → “New File” → “R script”
Save this script where your samples folders are (‘File’ → ‘Save’). These should be on your H or W drive. Save the script file as ‘scrnaseq.R’

In the following sections you will be copying and running the R code into your scrnaseq.R script.

Cell Ranger (and nfcore/scrnaseq) generates a default folder and file output structure. There will be a main folder that contains all the sample subfolders (NOTE: this is where you must save your R script). Each sample folder will have an ‘outs’ subfolder. This ‘outs’ folder contains a ‘filtered_feature_bc_matrix’ folder, which contains the files that Seurat uses in its analysis.

...

You can manually set your working directory in RStudio by selecting ‘Session' -> 'Set working directory' -> 'Choose directory'. Choose the same directory as you saved your scrnaseq.R script, previous section. This will output the setwd(...) command with your working directory into the console window (bottom left panel). Copy this command to replace the default setwd(...) line in your R script.

...

Code Block

#### 3a. Choose a sample to work with and import the data for that sample into R ####

# Give the sample name here that you want to work with.
## **USER
INPUT**
sample <- "Cerebellum"
# To see the available samples (choose a sample name from this list):
list.dirs(full.names = F, recursive = F)
## **USER INPUT**
sample <- "Cerebellum"


# Use Seurat's 'Read10X()' function to read in the full sample database. Cell Ranger creates 3 main database files that need to be combined into a single Seurat object.
# Note: these datasets can be very large and take several minutes to import into R. 
mat <- Read10X(data.dir = paste0(sample, "/outs/filtered_feature_bc_matrix"))

# Have a look at the top 10 rows and columns to see if the data has been imported correctly. You should see gene IDs as rows and barcodes (i.e. cells) as columns
as.matrix(mat[1:10, 1:10])

# Now convert this to a Seurat object. Again, this may take several minutes to load and use a lot of memory
mat2 <- CreateSeuratObject(counts = mat, project = sample)

# You can see a summary of the data by simply running the Seurat object name
mat2

# Set a colour palette that can contrast multiple clusters when you plot them. 
# You can change these colours as you like. 
# You can see what R colours are available here: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
c25 <- c("dodgerblue2", "#E31A1C", "green4",  "#6A3D9A", "#FF7F00", "black", "gold1", "skyblue2", "#FB9A99", "palegreen2", "#CAB2D6", "#FDBF6F", "gray70", "khaki2", "maroon", "orchid1", "deeppink1", "blue1", "steelblue4", "darkturquoise", "green1", "yellow4", "yellow3", "darkorange4", "brown")

...

Code Block

#### 4a. Choose samples to aggregate and import data ####

# See which samples you've run through section 3
# This will list every <samplename>_seurat_filtered.rds data file that you've generated
dir(pattern = "seurat_filtered.rds")

# Enter the sample names you wish to aggregate from the list above (just the names, without the '_seurat_filtered.rds').
## **USER INPUT**
samples <- c("Cerebellum", "Cortex")

# Give your aggregate samples a name (for example, "high_low"). This is for naming plots and such.
sample <- "Cerebellum_Cortex"

# Now import the data files for those samples.
samplist <- list()
for (i in 1:length(samples)){
    assign(samples[i], readRDS(paste0(samples[i], "_seurat_filtered.rds")))
    samplist[[i]] <- get(samples[i])
    }

# Then merge these datasets (note that this tags your cells with your sample names)
data <- merge(samplist[[1]], y = samplist[[2:length(samplist)]], add.cell.ids = samples, project = sample)

# See a summary of your data object (note that 'samples' means cells and 'features' means genes)
data

# This should contain the combined number of cells from all your samples. 
# You can see the number of cells in each sample by typing a one of your combined sample namenames below.
## **USER INPUT**
Cerebellum

4b. Processing expression data (dimensionality reduction)

This is essentially a repeat of dimensionality reduction for individual samples in section 3, so see that section for details. The following steps are required to generate dimensionality reduction analysis (PCA, T-SNE and UMAP) for the aggregate data.

NOTE: this regenerates clustering based on the combined data. Any plots and analysis in this section are therefore based on this, rather than the individual sample clustering. If you wish to compare clustering per sample, you'll need to use the results in section 3, where you analysed each sample separately.

Code Block

#### 4b. Processing expression data (dimensionality reduction) ####

# Normalise data
data1 <- NormalizeData(data)
# Identification of variable features
data1 <- FindVariableFeatures(data1, selection.method = "vst", nfeatures = nrow(data1))
# Scaling the data
all.genes <- rownames(data1)
data1 <- ScaleData(data1, features = all.genes)
# Perform linear dimensional reduction (PCA)
data1 <- RunPCA(data1, features = VariableFeatures(object = data1))

# The above normalises and scales the data, finds variable features and does the initial dimensionality reduction. 
# Now we can run t-SNE and UMAP reduction.
data1 <- RunUMAP(data1, dims = 1:3, verbose = F)
data1 <- RunTSNE(data1, dims = 1:3, verbose = F)

# As in section 3, we can view the top variable genes, but this time for the combined samples. 
# See section 3 for details on modifying the following plot:

# Identify the 10 most highly variable genes
top_genes <- head(VariableFeatures(data1), 10)
# plot variable features with labels
p <- VariableFeaturePlot(data1, pt.size = 2, cols = c("black", "firebrick"))
p <- LabelPoints(plot = p, points = top_genes, repel = TRUE) +
theme_bw() +
theme(text = element_text(size = 17))
p

# Export as a pdf and tiff
tiff_exp <- paste0(sample, "_top_genes.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")
pdf_exp <- paste0(sample, "_top_genes.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

# Now we can visualise your expression data using all 3 dimensionality reduction methods (next section)

4c. PCA, UMAP and t-SNE plots (plotting dimensionality reduction data)

Code Block

#### 4c. PCA, UMAP and t-SNE plots (plotting dimensionality reduction data) ####

# PCA plot
p <- DimPlot(data1, reduction = "pca", group.by = 'orig.ident', pt.size = 2, cols = alpha(c25[1:length(samples)],0.5)) + 
  theme_bw() +
  theme(axis.title=element_text(size=16), axis.text=element_text(size=14))
p

# The above generates a combined plot.
# Alternatively, you can separate the samples into 'facets' in a single plot, by running the following:
# NOTE: run either the combined or the split plot. Exporting the plot only exports the last plot you generated.
p <- DimPlot(data1, reduction = "pca", split.by = 'orig.ident', group.by = 'orig.ident', pt.size = 2, cols = alpha(c25[1:length(samples)],0.5)) + 
  theme_bw() +
  theme(axis.title=element_text(size=16), axis.text=element_text(size=14))
p

# Export as a pdf and tiff
tiff_exp <- paste0(sample, "_PCA_pre_filtration.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")
pdf_exp <- paste0(sample, "_PCA_pre_filtration.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")


# UMAP plot, samples combined
p <- DimPlot(data1, reduction = "umap", group.by = 'orig.ident', pt.size = 2, cols = alpha(c25[1:length(samples)], 0.5)) + 
  theme_bw() +
  theme(legend.position="none", axis.title=element_text(size=16), axis.text=element_text(size=14))
p

# Or as faceted plot (as above, with PCA)
p <- DimPlot(data1, reduction = "umap", split.by = 'orig.ident', group.by = 'orig.ident', pt.size = 2, cols = alpha(c25[1:length(samples)],0.5)) + 
  theme_bw() +
  theme(axis.title=element_text(size=16), axis.text=element_text(size=14))
p

# Export as a pdf and tiff
tiff_exp <- paste0(sample, "_umap_pre_filtration.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")
pdf_exp <- paste0(sample, "_umap_pre_filtration.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")


# tSNE plot, samples combined
p <- DimPlot(data1, reduction = "tsne", group.by = 'orig.ident', pt.size = 2, cols = alpha(c25[1:length(samples)], 0.5)) + 
  theme_bw() +
  theme(legend.position="none", axis.title=element_text(size=16), axis.text=element_text(size=14))
p

# Or as faceted plot
p <- DimPlot(data1, reduction = "tsne", split.by = 'orig.ident', group.by = 'orig.ident', pt.size = 2, cols = alpha(c25[1:length(samples)],0.5)) + 
  theme_bw() +
  theme(axis.title=element_text(size=16), axis.text=element_text(size=14))
p

# Export as a pdf and tiff
tiff_exp <- paste0(sample, "_tsne_pre_filtration.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")
pdf_exp <- paste0(sample, "_tsne_pre_filtration.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

4d Visualise gene expression by marker

This section allows you to produce a side-by-side visualisation of gene expression for your aggregated samples, for specific genes.

Code Block

#### 4d Visualise gene expression by marker ####

# Dimensionality plots
# Generate PCA, UMAP or t-SNE plots for selected markers, with side-by-side plots for each sample.

# Enter which markers you want to examine (can be one or many markers):
markers <- c("P2ry12", "Tmem119", "Itgam")

# Select which type of plot you want to generate ("pca", "umap" or "tsne"):
## **USER INPUT**
redplot <- "pca"

# Generate the plot
p <- FeaturePlot(data1, features = markers, reduction = redplot, cols = c("lightgrey", "red"), split.by = 'orig.ident', pt.size = 1)
p

# Export as a pdf and tiff
tiff_exp <- paste0(sample, "_", redplot, "_markers_filtered.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")
pdf_exp <- paste0(sample, "_", redplot, "_markers_filtered.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")


# Violin plot
# Here you can visualise the relative expression of markers on each cluster. 
# NOTE: This plot is specifically for comparing 2 samples. This plot may not be informative if you have more than 2 samples in your aggregte dataset

# Generate the plot
p <- VlnPlot(data1, features = markers, split.plot = TRUE, split.by = 'orig.ident', cols = c25[1:length(samples)])

# Export as a pdf and tiff
tiff_exp <- paste0(sample, "_markers_violin.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")
pdf_exp <- paste0(sample, "_markers_violin.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")


# Heatmap
# You can also visualise differences in marker expression between samples with a heatmap

# Generate the heatmap
p <- DoHeatmap(data1, features = markers, raster = T, group.by = 'orig.ident') + 
scale_fill_gradientn(colors = c("darkorange", "floralwhite", "dodgerblue4")) + 
theme(text = element_text(size = 16)) + labs(color = "Dose (mg)")
p

# Export as a pdf and tiff
tiff_exp <- paste0(sample, "_hmap_markers.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")
pdf_exp <- paste0(sample, "_hmap_markers.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

4e. Differential expression between samples

In the next section (section 5) we examine differential expression between clusters in individual samples. In this section we examine differential expression between samples.

We’ll be using the Seurat function FindMarkers to identify differentially expressed genes. This may require some modification (in particular the logfc.threshold parameter), depending on how many DE genes in your dataset and how significant they are.

By default, we’ve set logfc.threshold = 0.2 . This only tests genes with at least 0.2 log fold difference in expression and speeds up the analysis considerably. But it could also remove some significant DE genes.

To test for this, look at the bottom 6 genes (ordered by p value) by running the tail(DE_genes) command (in the script below). If these genes are all non-significant (i.e. p_val_adj > 0.05) then you have captured all significant genes. If all these genes are significant (p_val_adj < 0.05) then re-run FindMarkers() with logfc.threshold = 0.1. This will take much longer to run, but should then capture all DE genes (if not, reduce logfc.threshold = 0, but will take a very long time to run).

Code Block

#### 4e. Differential expression between samples ####

# You will need to choose two of your samples to compare, based on your aggregated dataset. 
# To see which samples you've aggregated:
samples

# Now select the first sample (which will be the baseline sample):
## **USER INPUT**
desamp1 <- "Cerebellum"

# Then select the second (comparison) sample:
## **USER INPUT**
desamp2 <- "Cortex"

# Then run the Seurat differential expression function:
DE_genes <- FindMarkers(data1, group.by = 'orig.ident', ident.1 = desamp1, ident.2 = desamp2, logfc.threshold = 0.2)

# We can see how many 'differentially expressed' genes this produced by:
nrow(DE_genes)

# As mentioned in the descriptive text for this section, using 'logfc.threshold = 0.2' may not find all DE genes

# Run the following:
tail(DE_genes)
# If all the listed genes are non-significant (i.e. p_val_adj < 0.05) then you have captured all genes
# If not, change to 'logfc.threshold = 0.1' and re-run the 'DE_genes <- FindMarkers(..)' line
# Run 'tail(DE_genes)' to again see if you've captured all DE genes
# If you still have significant genes in your list, re-run with 'logfc.threshold = 0'
# This will take a long time to run, but is guaranteed to capture all DE genes

# You can also view the top 20 DE genes by:
head(DE_genes, 20)

# You can extract just the **significantly** differentially expressed genes (adjusted p < 0.05) like so:
DE_genes_sig <- DE_genes[DE_genes$p_val_adj < 0.05, ]

# See how many significantly DE genes there are:
nrow(DE_genes_sig)

# You can also extract genes based on log fold change as well. 
# Select a log fold change threshold here:
## **USER INPUT**
lfc_threshold <- 0.3

# Then filter your data by this log fold change threshold.
DE_genes_sig_lfc <- DE_genes_sig[DE_genes_sig$avg_log2FC < -lfc_threshold | DE_genes_sig$avg_log2FC > lfc_threshold, ]

# Then see how many sig DE genes remain after lfc filtration:
nrow(DE_genes_sig_lfc)

# **NOTE** You can easily filter out all your results here by using too high a lfc threshold. 
# You can see the maximum log fold change for all DE genes like so:
max(DE_genes$avg_log2FC)

# And you can view the minimum as well. Use this min/max log fold change information to decide on a suitable log fold change threshold. This can vary greatly, depending on your data.

Version	Old Version 49	New Version 50
Changes made by	Paul Whatmore (Deactivated)	Paul Whatmore (Deactivated)
Saved on	20/02/2024	22/02/2024

Versions Compared

Key

4b. Processing expression data (dimensionality reduction)

4c. PCA, UMAP and t-SNE plots (plotting dimensionality reduction data)

4d Visualise gene expression by marker

4e. Differential expression between samples