Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

...

  1. Open RStudio (you can type it in the Windows search bar)

  2. Create a new R script: ‘File’ → “New File” → “R script”

  3. Save this script where your samples folders are (‘File’ → ‘Save’). These should be on your H or W drive. Save the script file as scrnaseq.R

In the following sections you will be copying and running the R code into your scrnaseq.R script.

Cell Ranger (and nfcore/scrnaseq) generates a default folder and file output structure. There will be a main folder that contains all the sample subfolders (NOTE: this is where you must save your R script). Each sample folder will have an ‘outs’ subfolder. This ‘outs’ folder contains a ‘filtered_feature_bc_matrix’ folder, which contains the files that Seurat uses in its analysis.

...

You can manually set your working directory in RStudio by selecting ‘Session' -> 'Set working directory' -> 'Choose directory'. Choose the same directory as you saved your scrnaseq.R script, previous section. This will output the setwd(...) command with your working directory into the console window (bottom left panel). Copy this command to replace the default setwd(...) line in your R script.

...

IMPORTANT: the gene symbols you provide in the following section have to exactly match the gene symbols in your dataset (including capitalisation).
Gene symbols are more like 'common names' and can vary between databases.
Your main gene identifiers are Ensembl IDs and we need to find the gene symbols that match these Ensembl IDs.
For example, the gene P2ry12 is also called ADPG-R, BDPLT8, HORK3 and various other IDs, depending on the database it's listed in.
In the Ensemble database it's listed as P2ry12 (not P2RY12, remember, case matters) and matches Ensembl ID ENSMUSG00000036353.
For this reason it's advisable to first search the Ensembl website for your markers of interest and for your organism, to ensure you are providing gene symbols that match the Ensembl IDs.
https://asia.ensembl.org/Mus_musculus/Info/Index

Code Block
#### 2f. IdenitifyIdentify markers in cells ####

# Create a vector called 'markers' that contains each of the markers you want to examine. 
# These should be gene symbols. Replace the gene symbols below with your target markers.
## **USER INPUT**
markers <- c("P2ry12", "Tmem119", "Itgam")

# You can see if the markers you provided are present:
sum(row.names(mat) %in% markers)
# If you input 3 markers and the output from the above code = 3, then all are present. 
# If the result is 2 then 2 of the 3 markers you provided are found in your data, etc.

## USER INPUT
# You can see if an individual marker is present like so (substitute for a marker of choice):
sum(row.names(mat) == "P2ry12")
# Outputs 1 if the marker is present, 0 if it isn't

# Pull out just the read counts for your defined markers
y <- mat[row.names(mat) %in% markers, ]

# Now we can count the number of cells containing zero transcripts for each of the examined markers. 
# This enables an examination of the number of cells that have zero expression for these markers,
# and therefore the number of cells that can be considered non-target cells.

# First count all cells
# Then make a loop to cycle through all markers (defined in previously created 'markers' vector)
a <- length(colnames(y))
for (i in 1:length(markers)) {
  
  a <- c(a, sum(y[i,] == 0))
  
}
# Do a sum of the columns
y2 <- colSums(y)
# See if any zeros. If so, these cells are not target cells (as determined by absence of any target cell markers)
count <- c(a, sum(y2 == 0))
# Name the vector elements
names(count) <- c("Total_cells", markers, "All_zero")
# Generate the table
as.data.frame(count)

# The abovegenerated table shows:
# 1) the total number of cells for your sample ("Total_cells" row), 
# 2) the number of cells which had 0 expression for each marker, and,
# and3) the number of cell that had zero expression for all of the markers you provided ("All_zero" row).

2g. Processing expression data (dimensionality reduction)

...

Code Block
#### 2g. Processing expression data (dimensionality reduction) ####

# Normalise data
mat3 <- NormalizeData(mat2)
# Identification of variable features
mat3 <- FindVariableFeatures(mat3, selection.method = "vst", nfeatures = nrow(mat3))
# Scaling the data
all.genes <- rownames(mat3)
mat3 <- ScaleData(mat3, features = all.genes)
# Perform linear dimensional reduction (PCA. This may take several minutes to run)
mat3 <- RunPCA(mat3, features = VariableFeatures(object = mat3))

...

Using the FindVariableFeatures results, we can visualise the most highly variable genes, including a count of variable and non variable genes in your dataset. The below code ouputs outputs the top 10 genes, but you can ajust adjust this number as desired (i.e. in top_genes <- head(VariableFeatures(mat3), 10) change 10 to another number).

...

Axis text size: theme(text = element_text(size = 17)). There are a large number of parameters that can be modified with theme(). Here we've just changed the axis text to size 17. See here for other parameters that can be changed with theme(): https://ggplot2.tidyverse.org/reference/theme.html

Surat Seurat plots are based on the ggplot package. There are a multitude of other modifications you can make to a ggplot, too many to describe in this notebook. But there are plenty of online guides on how to modify ggplot plots. Here's an example: http://www.sthda.com/english/wiki/be-awesome-in-ggplot2-a-practical-guide-to-be-highly-effective-r-software-and-data-visualization

Code Block
#### 2h. Plot of highly variable genes ####

# Identify the 10 most highly variable genes
top_genes <- head(VariableFeatures(mat3), 10)

# plot variable features with labels
p <- VariableFeaturePlot(mat3, pt.size = 2, cols = c("black", "firebrick"))
p <- LabelPoints(plot = p, points = top_genes, repel = TRUE) +
theme_bw() +
theme(text = element_text(size = 17))
p

# You can save your plot as a 300dpi (i.e. publication quality) tiff or pdf file. 
# These files can be found in your working directory.
# You can adjust the width and height of the saved images by changing width = and height = in the code below.

# Export as a 300dpi tiff
tiff_exp <- paste0(sample, "_top_genes.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

# Export as a pdf
pdf_exp <- paste0(sample, "_top_genes.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

2i. PCA, UMAP and t-SNE plots (plotting dimensionality reduction data)

In the section 2g we ran dimensionality reduction based on Principal Component Analysis (PCA).

There are other dimensionality reduction methods that are more commonly presented in singe cell papers, particularly Uniform Manifold Approximation and Projection (UMAP and t-distributed stochastic neighbour embedding (t-SNE) dimensionality reduction, so here we will generate and visualise dimensionality reduction from all 3 methods (PCA, UMAP, t-SNE).

First we need to also run UMAP and tSNE dimensionality reduction and add these results to our main Seurat object.

Code Block
mat3 <- RunUMAP(mat3, dims = 1:3, verbose = F)
mat3 <- RunTSNE(mat3, dims = 1:3, verbose = F)

Now we can plot all 3 results. You can modify the code to alter the appearance of these plots as described in section 2h.

PCA

Code Block
# Generate the PCA plot
p <- DimPlot(mat3, reduction = "pca", pt.size = 2, label = TRUE, label.size = 6, label.color = "white", cols = c("firebrick")) +
theme_bw() +
theme(legend.position="none", axis.title=element_text(size=16), axis.text=element_text(size=14))
p

# Export as a TIFF and PDF
tiff_exp <- paste0(sample, "_PCA_pre_filtration.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

pdf_exp <- paste0(sample, "_PCA_pre_filtration.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

UMAP

Code Block
p <- DimPlot(mat3, reduction = "umap", pt.size = 2, label = TRUE, label.size = 8, label.color = "black", cols = c("firebrick")) +
theme_bw() +
theme(legend.position="none", axis.title=element_text(size=16), axis.text=element_text(size=14))
p

tiff_exp <- paste0(sample, "_umap_pre_filtration.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

pdf_exp <- paste0(sample, "_umap_pre_filtration.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

tSNE

Code Block
p <- DimPlot(mat3, reduction = "tsne", pt.size = 2, label = TRUE, label.size = 8, label.color = "black", cols = c("firebrick")) +
theme_bw() +
theme(legend.position="none", axis.title=element_text(size=16), axis.text=element_text(size=14))
p

tiff_exp <- paste0(sample, "_tsne_pre_filtration.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

pdf_exp <- paste0(sample, "_tsne_pre_filtration.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

2j. Remove low quality or outlier cells

From the Seurat website:

Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. A few QC metrics commonly used by the community include

- Low-quality cells or empty droplets will often have very few genes

- Cell doublets or multiplets may exhibit an aberrantly high gene count

So in this section we can filter out cells that have very low and very high gene counts.