Table of Contents

Aim

Analyse 10x Genomics* single cell RNA-Seq data.

...

*Note: this workflow can be adapted to work with scRNA-Seq datasets generated by other sequencing technologies than 10x Genomics.

Requirements

Table of Contents

Aim

Analyse 10x Genomics* single cell RNA-Seq data.

This analysis workflow is split into 2 main sections: 1) ‘Upstream’ analysis on QUT’s HPC (high performance compute cluster) using aNextflow workflow, nfcore/scrnaseq, 2) ‘Downstream’ analysis in R, primarily using the package seurat.

*Note: this workflow can be adapted to work with scRNA-Seq datasets generated by other sequencing technologies than 10x Genomics.

Requirements

A HPC account. If you do not have one, please request one here.
Access to an rVDI virtual desktop machine with 64GB RAM. For information about rVDI and requesting a virtual machine, see here.
Nextflow installed on your HPC home account. If you haven’t already installed Nextflow, do so by following the guide here.
Your scRNA-Seq data (fastq files) are on the HPC. If you are having difficulties transferring them to the HPC, submit a support ticket here.

...

We’ll be using various R packages to analyse our 10x single cell data. In this section we’ll create an R script in RStudio and install the required R packages.

2a. Open RStudio and create a new R script

RStudio is a GUI (graphical user interface) for R. It makes navigating R easier.

Open RStudio (you can type it in the Windows search bar)
Create a new R script: ‘File’ → “New File” → “R script”
Save this script where your samples folders are (‘File’ → ‘Save’). These should be on your H or W drive. Save the script file as ‘scrnaseq.R’

In the following sections you will be copying and running the R code into your scrnaseq.R script.

Cell Ranger (and nfcore/scrnaseq) generates a default folder and file output structure. There will be a main folder that contains all the sample subfolders (NOTE: this is where you must save your R script). Each sample folder will have an ‘outs’ subfolder. This ‘outs’ folder contains a ‘filtered_feature_bc_matrix’ folder, which contains the files that Seurat uses in its analysis.

If you have already created this script in a previous session, you can just re-open it by browsing to the location (somewhere in your mapped H drive, probably) and double clicking on it.

2b. Set your working directory

In R, your working directory is where your data files are read in to R from and where any output files are deposited. For our purposes we need to set the working directory to the location on the HPC where your scRNASeq dataset is.

Most of the scripts can be run without modification, but there are a few lines that you will need to change, such as the working directory (which will differ for each researcher’s dataset).

When you see **USER INPUT** in the script, this means you have to modify the line below this.

You can manually set your working directory in RStudio by selecting ‘Session' -> 'Set working directory' -> 'Choose directory'. Choose the same directory as you saved your scrnaseq.R script, previous section. This will output the setwd(...) command with your working directory into the console window (bottom left panel). Copy this command to replace the default setwd(...) line in your R script.

Copy and paste the following code into the R script you just created, then run the code (highlight all the code in your R script, then press the run button).

...

Code Block

#### 2b. Set your working directory ####

# Change the below to the directory that contains your sample folders (you may have to browse H or W drive to find this)
# **USER INPUT**
setwd("H:/sam_dando/dataset1/count")

# You can see the sample subdirectories by:
list.dirs(full.names = F, recursive = F)
# You should see directories that are names after your samples. 
# If you don't see this, browse through your H or W drives to find the correct path for your sample directories.

2c. Installing packages

This will install all the required packages and dependencies and may take 30 minutes or more to complete. It may prompt you occasionally to update packages - select 'a' for all if/when this occurs.

...

R packages.

2a. Open RStudio and create a new R script

RStudio is a GUI (graphical user interface) for R. It makes navigating R easier.

Open RStudio (you can type it in the Windows search bar)
Create a new R script: ‘File’ → “New File” → “R script”
Save this script where your samples folders are (‘File’ → ‘Save’). These should be on your H or W drive. Save the script file as ‘scrnaseq.R’

In the following sections you will be copying and running the R code into your scrnaseq.R script.

Cell Ranger (and nfcore/scrnaseq) generates a default folder and file output structure. There will be a main folder that contains all the sample subfolders (NOTE: this is where you must save your R script). Each sample folder will have an ‘outs’ subfolder. This ‘outs’ folder contains a ‘filtered_feature_bc_matrix’ folder, which contains the files that Seurat uses in its analysis.

If you have already created this script in a previous session, you can just re-open it by browsing to the location (somewhere in your mapped H drive, probably) and double clicking on it.

2b. Set your working directory

In R, your working directory is the location from which your data files are read in to R and any output files are deposited. For our purposes we need to set the working directory to the location on the HPC where your scRNASeq dataset is.

Most of the scripts can be run without modification, but there are a few lines that you will need to change, such as the working directory (which will differ for each researcher’s dataset).

When you see **USER INPUT** in the script, this means you have to modify the line below this.

You can manually set your working directory in RStudio by selecting ‘Session' -> 'Set working directory' -> 'Choose directory'. Choose the same directory as you saved your scrnaseq.R script, previous section. This will output the setwd(...) command with your working directory into the console window (bottom left panel). Copy this command to replace the default setwd(...) line in your R script.

Copy and paste the following code into the R script you just created, then run the code (highlight all the code in your R script, then press the run button).

...

Code Block

#### 2b. Set your working directory ####

# Change the below to the directory that contains your sample folders (you may have to browse H or W drive to find this)
# **USER INPUT**
setwd("H:/sam_dando/dataset1/count")

# You can see the sample subdirectories by:
list.dirs(full.names = F, recursive = F)
# You should see directories that are names after your samples. 
# If you don't see this, browse through your H or W drives to find the correct path for your sample directories.

2c. Installing packages

This will install all the required packages and dependencies and may take 30 minutes or more to complete. It may prompt you occasionally to update packages - select 'a' for all if/when this occurs.

Code Block

#### 2c. Installing required packages ####

# This section only needs to be run once on a computer. 
# One the packages are installed, they need to be loaded every time they will be used (next section)

# Create vector of required package names
bioconductor_packages <- c("clusterProfiler", "pathview", "AnnotationHub", "org.Mm.eg.db")
cran_packages <- c("Seurat", "patchwork", "ggplot2", "tidyverse", "viridis", "plyr", "readxl", "scales")

# Compares installed packages to above packages and returns a vector of missing packages
new_packages <- bioconductor_packages[!(bioconductor_packages %in% installed.packages()[,"Package"])]
new_cran_packages <- cran_packages[!(cran_packages %in% installed.packages()[,"Package"])]

# Install missing bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install(new_packages)

# Install missing cran packages
if (length(new_cran_packages)) install.packages(new_cran_packages, repos = "http://cran.us.r-project.org")

# Update all installed packages to the latest version
update.packages(bioconductor_packages, ask = FALSE)
update.packages(cran_packages, ask = FALSE, repos = "http://cran.us.r-project.org")

2d. Loading packages

Code Block
#### 2d. Loading required packages #### # This section

...

needs

...

 to be run every time

# Load packages
bioconductor_packages <- c("clusterProfiler", "pathview", "AnnotationHub", "org.Mm.eg.db")
cran_packages <- c("Seurat", "patchwork", "ggplot2", "tidyverse", "viridis", "plyr", "readxl", "scales")
lapply(cran_packages, require,

...

character.only

...

2d. Loading packages

Code Block

#### 2d. Loading required packages ####

# This section needs to be run every time

# Load packages
bioconductor_packages <- c("clusterProfiler", "pathview", "AnnotationHub", "org.Mm.eg.db")
cran_packages <- c("Seurat", "patchwork", "ggplot2", "tidyverse", "viridis", "plyr", "readxl", "scales")
lapply(cran_packages, require, character.only = TRUE)
lapply(bioconductor_packages, require, character.only = TRUE)

3. Analysis using the seurat package

Seurat is:

A toolkit for quality control, analysis, and exploration of single cell RNA sequencing data. 'Seurat' aims to enable users to identify and interpret sources of heterogeneity from single cell transcriptomic measurements, and to integrate diverse types of single cell data.

In this section 3 we’ll be using a variety of Seurat functions to analyse our scRNA-Seq data.

3a. Select a sample to work with and import the data into R

In this section we’ll choose one of our samples to analyse, then import that sample’s scRNA-Seq dataset into R. You can re-run this, and following sections, for each sample dataset that you have. Just replace the sample name in sample <- "xxxxxx" below.

Code Block

#### 3a. Choose a sample to work with and import the data for that sample into R ####

# Give the sample name here that you want to work with.
# To see the available samples (choose a sample name from this list):
list.dirs(full.names = F, recursive = F)
## **USER INPUT**
sample <- "Cerebellum"


# Use Seurat's 'Read10X()' function to read in the full sample database. Cell Ranger creates 3 main database files that need to be combined into a single Seurat object.
# Note: these datasets can be very large and take several minutes to import into R. 
mat <- Read10X(data.dir = paste0(sample, "/outs/filtered_feature_bc_matrix"))

# Have a look at the top 10 rows and columns to see if the data has been imported correctly. You should see gene IDs as rows and barcodes (i.e. cells) as columns
as.matrix(mat[1:10, 1:10])

# Now convert this to a Seurat object. Again, this may take several minutes to load and use a lot of memory
mat2 <- CreateSeuratObject(counts = mat, project = sample)

# You can see a summary of the data by simply running the Seurat object name
mat2

# Set a colour palette that can contrast multiple clusters when you plot them. 
# You can change these colours as you like. 
# You can see what R colours are available here: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
c25 <- c("dodgerblue2", "#E31A1C", "green4",  "#6A3D9A", "#FF7F00", "black", "gold1", "skyblue2", "#FB9A99", "palegreen2", "#CAB2D6", "#FDBF6F", "gray70", "khaki2", "maroon", "orchid1", "deeppink1", "blue1", "steelblue4", "darkturquoise", "green1", "yellow4", "yellow3", "darkorange4", "brown")

3b. Identify markers in cells

Now we're going to identify individual markers that were present (i.e. were expressed) in our dataset.

IMPORTANT: the gene symbols you provide in the following section have to exactly match the gene symbols in your dataset (including capitalisation).
Gene symbols are more like 'common names' and can vary between databases.
Your main gene identifiers are Ensembl IDs and we need to find the gene symbols that match these Ensembl IDs.
For example, the gene P2ry12 is also called ADPG-R, BDPLT8, HORK3 and various other IDs, depending on the database it's listed in.
In the Ensemble database it's listed as P2ry12 (not P2RY12, remember, case matters) and matches Ensembl ID ENSMUSG00000036353.
For this reason it's advisable to first search the Ensembl website for your markers of interest and for your organism, to ensure you are providing gene symbols that match the Ensembl IDs.
https://asia.ensembl.org/Mus_musculus/Info/Index

Code Block

#### 3b. Identify markers in cells ####

# Create a vector called 'markers' that contains each of the markers you want to examine. 
# These should be gene symbols. Replace the gene symbols below with your target markers.
## **USER INPUT**
markers <- c("P2ry12", "Tmem119", "Itgam")

# You can see if the markers you provided are present:
sum(row.names(mat) %in% markers)
# If you input 3 markers and the output from the above code = 3, then all are present. 
# If the result is 2 then 2 of the 3 markers you provided are found in your data, etc.

## USER INPUT
# You can see if an individual marker is present like so (substitute for a marker of choice):
sum(row.names(mat) == "P2ry12")
# Outputs 1 if the marker is present, 0 if it isn't

# Pull out just the read counts for your defined markers
y <- mat[row.names(mat) %in% markers, ]

# Now we can count the number of cells containing zero transcripts for each of the examined markers. 
# This enables an examination of the number of cells that have zero expression for these markers,
# and therefore the number of cells that can be considered non-target cells.

# First count all cells
# Then make a loop to cycle through all markers (defined in previously created 'markers' vector)
a <- length(colnames(y))
for (i in 1:length(markers)) {
  
  a <- c(a, sum(y[i,] == 0))
  
}
# Do a sum of the columns
y2 <- colSums(y)
# See if any zeros. If so, these cells are not target cells (as determined by absence of any target cell markers)
count <- c(a, sum(y2 == 0))
# Name the vector elements
names(count) <- c("Total_cells", markers, "All_zero")
# Generate the table
as.data.frame(count)

# The generated table shows:
# 1) the total number of cells for your sample ("Total_cells" row), 
# 2) the number of cells which had 0 expression for each marker, and,
# 3) the number of cell that had zero expression for all of the markers you provided ("All_zero" row).

3c. Processing expression data (dimensionality reduction)

There are a variety of methods to visualise expression in single cell data. The most commonly used methods - PCA, t-SNE and UMAP - involve 'dimensionality', i.e. converting expression to x-n dimensions (which can then be plotted) based on gene expression per cell.

Seurat can generate and store PCA, t-SNE and UMAP data in the Seurat object we created previously ('mat2'), but first the raw data needs to be processed in a variety of ways:

Normalise the data by log transformation
Identify genes that exhibit high cell-to-cell variation
Scale the data so that highly expressed genes don't dominate the visual representation of expression
Perform the linear dimensional reduction that converts expression to dimensions
Plot the x-y dimension data (i.e. first 2 dimensions)

The first 4 steps are completed in the code below (this may take a few minutes to run)

Code Block

#### 3c. Processing expression data (dimensionality reduction) ####

# Normalise data
mat3 <- NormalizeData(mat2)

# Identification of variable features
mat3 <- FindVariableFeatures(mat3, selection.method = "vst", nfeatures = nrow(mat3))

# Scaling the data
all.genes <- rownames(mat3)
mat3 <- ScaleData(mat3, features = all.genes)

# Perform linear dimensional reduction (PCA. This may take several minutes to run)
mat3 <- RunPCA(mat3, features = VariableFeatures(object = mat3))

...

TRUE)
lapply(bioconductor_packages, require, character.only = TRUE)

3. Analysis using the seurat package

Seurat is:

A toolkit for quality control, analysis, and exploration of single cell RNA sequencing data. 'Seurat' aims to enable users to identify and interpret sources of heterogeneity from single cell transcriptomic measurements, and to integrate diverse types of single cell data.

In this section 3 we’ll be using a variety of Seurat functions to analyse our scRNA-Seq data.

3a. Select a sample to work with and import the data into R

################################################################

#TO DO: Make sure the folder structure provided follows the Cell ranger output format.

Code Block
├── filtered_feature_bc_matrix │ ├── barcodes.tsv.gz │ ├── features.tsv.gz │ └── matrix.mtx.gz

Clarify how the files will be output from the upstream Nextflow pipeline and update text accordingly to guide to the output folder.

Code Block
mat <- Read10X(data.dir = "H:/nfcore-scrnaseq_R_analysis/test_data/Choroid/filtered_feature_bc_matrix"

################################################################

In this section we’ll choose one of our samples to analyse, then import that sample’s scRNA-Seq dataset into R. You can re-run this, and following sections, for each sample dataset that you have. Just replace the sample name in sample <- "xxxxxx" below.

Code Block

#### 3a. Choose a sample to work with and import the data for that sample into R ####

# Give the sample name here that you want to work with.
# To see the available samples (choose a sample name from this list):
list.dirs(full.names = F, recursive = F)
## **USER INPUT**
sample <- "Cerebellum"


# Use Seurat's 'Read10X()' function to read in the full sample database. Cell Ranger creates 3 main database files that need to be combined into a single Seurat object.
# Note: these datasets can be very large and take several minutes to import into R. 
mat <- Read10X(data.dir = paste0(sample, "/outs/filtered_feature_bc_matrix"))

# Have a look at the top 10 rows and columns to see if the data has been imported correctly. You should see gene IDs as rows and barcodes (i.e. cells) as columns
as.matrix(mat[1:10, 1:10])

# Now convert this to a Seurat object. Again, this may take several minutes to load and use a lot of memory
mat2 <- CreateSeuratObject(counts = mat, project = sample)

# You can see a summary of the data by simply running the Seurat object name
mat2

# Set a colour palette that can contrast multiple clusters when you plot them. 
# You can change these colours as you like. 
# You can see what R colours are available here: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
c25 <- c("dodgerblue2", "#E31A1C", "green4",  "#6A3D9A", "#FF7F00", "black", "gold1", "skyblue2", "#FB9A99", "palegreen2", "#CAB2D6", "#FDBF6F", "gray70", "khaki2", "maroon", "orchid1", "deeppink1", "blue1", "steelblue4", "darkturquoise", "green1", "yellow4", "yellow3", "darkorange4", "brown")

3b. Identify markers in cells

Now we're going to identify individual markers that were present (i.e. were expressed) in our dataset.

IMPORTANT: the gene symbols you provide in the following section have to exactly match the gene symbols in your dataset (including capitalisation).
Gene symbols are more like 'common names' and can vary between databases.
Your main gene identifiers are Ensembl IDs and we need to find the gene symbols that match these Ensembl IDs.
For example, the gene P2ry12 is also called ADPG-R, BDPLT8, HORK3 and various other IDs, depending on the database it's listed in.
In the Ensemble database it's listed as P2ry12 (not P2RY12, remember, case matters) and matches Ensembl ID ENSMUSG00000036353.
For this reason it's advisable to first search the Ensembl website for your markers of interest and for your organism, to ensure you are providing gene symbols that match the Ensembl IDs.
https://asia.ensembl.org/Mus_musculus/Info/Index

Code Block

#### 3b. Identify markers in cells ####

# Create a vector called 'markers' that contains each of the markers you want to examine. 
# These should be gene symbols. Replace the gene symbols below with your target markers.
## **USER INPUT**
markers <- c("P2ry12", "Tmem119", "Itgam")

# You can see if the markers you provided are present:
sum(row.names(mat) %in% markers)
# If you input 3 markers and the output from the above code = 3, then all are present. 
# If the result is 2 then 2 of the 3 markers you provided are found in your data, etc.

## USER INPUT
# You can see if an individual marker is present like so (substitute for a marker of choice):
sum(row.names(mat) == "P2ry12")
# Outputs 1 if the marker is present, 0 if it isn't

# Pull out just the read counts for your defined markers
y <- mat[row.names(mat) %in% markers, ]

# Now we can count the number of cells containing zero transcripts for each of the examined markers. 
# This enables an examination of the number of cells that have zero expression for these markers,
# and therefore the number of cells that can be considered non-target cells.

# First count all cells
# Then make a loop to cycle through all markers (defined in previously created 'markers' vector)
a <- length(colnames(y))
for (i in 1:length(markers)) {
  
  a <- c(a, sum(y[i,] == 0))
  
}
# Do a sum of the columns
y2 <- colSums(y)
# See if any zeros. If so, these cells are not target cells (as determined by absence of any target cell markers)
count <- c(a, sum(y2 == 0))
# Name the vector elements
names(count) <- c("Total_cells", markers, "All_zero")
# Generate the table
as.data.frame(count)

# The generated table shows:
# 1) the total number of cells for your sample ("Total_cells" row), 
# 2) the number of cells which had 0 expression for each marker, and,
# 3) the number of cell that had zero expression for all of the markers you provided ("All_zero" row).

3c. Processing expression data (dimensionality reduction)

There are a variety of methods to visualise expression in single cell data. The most commonly used methods - PCA, t-SNE and UMAP - involve 'dimensionality', i.e. converting expression to x-n dimensions (which can then be plotted) based on gene expression per cell.

Seurat can generate and store PCA, t-SNE and UMAP data in the Seurat object we created previously ('mat2'), but first the raw data needs to be processed in a variety of ways:

Normalise the data by log transformation
Identify genes that exhibit high cell-to-cell variation
Scale the data so that highly expressed genes don't dominate the visual representation of expression
Perform the linear dimensional reduction that converts expression to dimensions
Plot the x-y dimension data (i.e. first 2 dimensions)

The first 4 steps are completed in the code below (this may take a few minutes to run)

Code Block

#### 3c. Processing expression data (dimensionality reduction) ####

# Normalise data
mat3 <- NormalizeData(mat2)

# Identification of variable features
mat3 <- FindVariableFeatures(mat3, selection.method = "vst", nfeatures = nrow(mat3))

# Scaling the data
all.genes <- rownames(mat3)
mat3 <- ScaleData(mat3, features = all.genes)

# Perform linear dimensional reduction (PCA. This may take several minutes to run)
mat3 <- RunPCA(mat3, features = VariableFeatures(object = mat3))

3d. Plot of highly variable genes

################################################################

Question for Paul:

The following command returns a warning.

Code Block
p <- LabelPoints(plot = p, points = top_genes, repel = TRUE) + theme_bw() + theme(text = element_text(size = 17)) p

Code Block

Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session

################################################################

Using the FindVariableFeatures results, we can visualise the most highly variable genes, including a count of variable and non variable genes in your dataset. The below code outputs the top 10 genes, but you can adjust this number as desired (i.e. in top_genes <- head(VariableFeatures(mat3), 10) change 10 to another number).

...

Code Block

#### 3d. Plot of highly variable genes ####

# Identify the 10 most highly variable genes
top_genes <- head(VariableFeatures(mat3), 10)

# plot variable features with labels
p <- VariableFeaturePlot(mat3, pt.size = 2, cols = c("black", "firebrick"))
p <- LabelPoints(plot = p, points = top_genes, repel = TRUE) +
theme_bw() +
theme(text = element_text(size = 17))
p

# You can save your plot as a 300dpi (i.e. publication quality) tiff or pdf file. 
# These files can be found in your working directory.
# You can adjust the width and height of the saved images by changing width = and height = in the code below.

# Export as a 300dpi tiff
tiff_exp <- paste0(sample, "_top_genes.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

# Export as a pdf
pdf_exp <- paste0(sample, "_top_genes.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

# Export as a pdf
pdf_exp <- paste0(sample, "_top_genes.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

...

3e. PCA, UMAP and t-SNE plots (plotting dimensionality reduction data)

################################################################

Question for Paul:

one warning comes up when running RunUMAP:

Code Block

Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session

################################################################

In the section 3c we ran dimensionality reduction based on Principal Component Analysis (PCA).

...

Code Block

#### 3e. PCA, UMAP and t-SNE plots ####

# First we need to also run UMAP and tSNE dimensionality reduction and add these results to our main Seurat object.
# As with PCA, these may take several minutes to run
mat3 <- RunUMAP(mat3, dims = 1:3, verbose = F)
mat3 <- RunTSNE(mat3, dims = 1:3, verbose = F)

# Now we can plot all 3 results. You can modify the code to alter the appearance of these plots as described in section 2h.


# PCA plot

p <- DimPlot(mat3, reduction = "pca", pt.size = 2, label = TRUE, label.size = 6, label.color = "white", cols = c("firebrick")) +
theme_bw() +
theme(legend.position="none", axis.title=element_text(size=16), axis.text=element_text(size=14))
p

# Export as a TIFF and PDF
tiff_exp <- paste0(sample, "_PCA_pre_filtration.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")
pdf_exp <- paste0(sample, "_PCA_pre_filtration.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")


# UMAP

p <- DimPlot(mat3, reduction = "umap", pt.size = 2, label = TRUE, label.size = 8, label.color = "black", cols = c("firebrick")) +
theme_bw() +
theme(legend.position="none", axis.title=element_text(size=16), axis.text=element_text(size=14))
p

tiff_exp <- paste0(sample, "_umap_pre_filtration.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")
pdf_exp <- paste0(sample, "_umap_pre_filtration.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")


# t-SNE

p <- DimPlot(mat3, reduction = "tsne", pt.size = 2, label = TRUE, label.size = 8, label.color = "black", cols = c("firebrick")) +
theme_bw() +
theme(legend.position="none", axis.title=element_text(size=16), axis.text=element_text(size=14))
p

tiff_exp <- paste0(sample, "_tsne_pre_filtration.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")
pdf_exp <- paste0(sample, "_tsne_pre_filtration.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

...

 = "cm")

3f. Remove low quality or outlier cells

################################################################

Question for Paul:

Code Block
Warning: Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.

################################################################

From the Seurat website:

Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. A few QC metrics commonly used by the community include

...

Code Block

#### 5d.Differential expression: one cluster vs another cluster ####

# In this section you'll select to clusters to compare. Any upregulated genes (postive log fold change) are more highly expressed in the second cluster.

# First, select your baseline cluster:
## **USER INPUT**
declust <- 0

# Then, the cluster you want to compare it to:
## **USER INPUT**
declust2 <- 1

# Then run the Seurat DE function.
DE_genes <- FindMarkers(data, ident.1 = declust, ident.2 = declust2, logfc.threshold = 0.2)

# As in the previous section, you can run `tail(DE_genes)` to see if you need to reduce `logfc.threshold = 0.2` so as to capture all **significantly** DE genes.
tail(DE_genes)

# And again you can filter by p value and log fold change (i.e. change `p_val_adj` and `lfc_threshold` to suitable numbers), count the number of sig DE genes and view the top few.
DE_genes_sig <- DE_genes[DE_genes$p_val_adj < 0.05, ]
lfc_threshold <- 0.
DE_genes_sig_lfc <- DE_genes_sig[DE_genes_sig$avg_log2FC < -lfc_threshold | DE_genes_sig$avg_log2FC > lfc_threshold, ]
nrow(DE_genes_sig_lfc)
head(DE_genes_sig_lfc)

# Export the table of DE genes as a csv file:
write.csv(DE_genes_sig_lfc, paste0("DE_1vs1_clust", declust, "_vs_clust", declust2, "_", sample, ".csv"))

## Plotting ##

# Pull out the top 20 genes to plot (or change to the range or names of genes you want to plot, as in the previous section.)
plotgenes <- rownames(DE_genes_sig)[1:20]

# Scatter plot
p <- FeaturePlot `logfc.threshold = 0.2` so as to capture all **significantly** DE genes.
tail(DE_genes)

# And again you can filter by p value and log fold change (i.e. change `p_val_adj` and `lfc_threshold` to suitable numbers), count the number of sig DE genes and view the top few.
DE_genes_sig <- DE_genes[DE_genes$p_val_adj < 0.05, ]
lfc_threshold <- 0.
DE_genes_sig_lfc <- DE_genes_sig[DE_genes_sig$avg_log2FC < -lfc_threshold | DE_genes_sig$avg_log2FC > lfc_threshold, ]
nrow(DE_genes_sig_lfc)
head(DE_genes_sig_lfc)

# Export the table of DE genes as a csv file:
write.csv(DE_genes_sig_lfc, paste0("DE_1vs1_clust", declust, "_vs_clust", declust2, "_", sample, ".csv"))

## Plotting ##

# Pull out the top 20 genes to plot (or change to the range or names of genes you want to plot, as in the previous section.)
plotgenes <- rownames(DE_genes_sig)[1:20]

# Scatter plot
p <- FeaturePlot(data, features = plotgenes, cols = c("red", "lightgrey"), reduction = "pca", pt.size = 1)
p

# Export as a pdf and tiff
tiff_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")
pdf_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")


# Dot plot

p <- DotPlot(data, features = plotgenes, dot.scale = 8, cols = c("red", "lightgrey"), reduction = "pca", pt.)) + RotatedAxis() +
ylab("Cluster") + xlab("Genes") +
theme(text = element_text(size = 118))
p

# Export as a pdf and tiff
tiff_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes_dotplot.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")
pdf_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes_dotplot.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")


# DotViolin plotplots

p <- DotPlotVlnPlot(data, features = plotgenes, dot.scale = 8, cols = c("red", "lightgrey")) + RotatedAxis() +
ylab("Cluster") + xlab("Genes") +
theme(text = element_text(c25, pt.size = 180))
p

# Export as a pdf and tiff
tiff_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes_dotplotviolin.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")
pdf_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes_dotplotviolin.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

# ViolinHeatmap
plots

p <- VlnPlotDoHeatmap(data, features = plotgenes, cols = c25, pt. raster = T) + 
scale_fill_gradientn(colors = c("darkorange", "floralwhite", "dodgerblue4")) + 
theme(text = element_text(size = 016))
p

# Export as a pdf and tiff
tiff_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes_violinheatmap.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")
pdf_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes_heatmap.pdf")
ggsave(file = pdf_exp, device = "tiffpdf", plot = p, width = 20, height = 20, units = "cm")
pdf_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes_violin.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

# Heatmap
p <- DoHeatmap(data, features = plotgenes, raster = T) + 
scale_fill_gradientn(colors = c("darkorange", "floralwhite", "dodgerblue4")) + 
theme(text = element_text(size = 16))
p

# Export as a pdf and tiff
tiff_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes_heatmap.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")
pdf_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes_heatmap.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

# You can go back to the start of this section and change declust <- and declust2 <- to compare different clusters.

# You can go back to the start of this section and change declust <- and declust2 <- to compare different clusters.

5e. Differential expression: every cluster vs all other cells

Code Block

#### 5e. Differential expression: every cluster vs all other cells ####

# In section 5c we compared a chosen cluster to all other clusters/cells. In this section we can automate this somewhat and do this for every cluster at the same time.

# Unlike the other DE section, you don't need to choose any clusters to compare, just run the Seurat function:
DE_genes <- FindAllMarkers(data, logfc.threshold = 0.2)

# As with the previous sections, use `tail()` to see if you've captured all sig DE genes:
tail(DE_genes[order(DE_genes$p_val_adj, decreasing = F),])

# Now filter by adjusted p and lfc (enter your own parameters), then re-order by cluster:
DE_genes_sig <- DE_genes[DE_genes$p_val_adj < 0.05, ]
DE_genes_sig <- DE_genes_sig[order(DE_genes_sig$cluster),]
lfc_threshold <- 0.3
DE_genes_sig_lfc <- DE_genes_sig[DE_genes_sig$avg_log2FC < -lfc_threshold | DE_genes_sig$avg_log2FC > lfc_threshold, ]

# You can see how many DE genes per cluster were found:
summary(DE_genes_sig_lfc$cluster)

# Then export this table of DE genes per cluster as a csv file:
write.csv(DE_genes_sig, paste0("DE_eachclustVSall_", sample, ".csv"))

Version	Old Version 54	New Version Current
Changes made by	Paul Whatmore	Marie-Emilie Gauthier
Saved on	Feb 22, 2024	Apr 19, 2024

Versions Compared

Key

Aim

Requirements

Aim

Requirements

2a. Open RStudio and create a new R script

2b. Set your working directory

2c. Installing packages

2a. Open RStudio and create a new R script

2b. Set your working directory

2c. Installing packages

2d. Loading packages

2d. Loading packages

3. Analysis using the seurat package

3a. Select a sample to work with and import the data into R

3b. Identify markers in cells

3c. Processing expression data (dimensionality reduction)

3. Analysis using the seurat package

3a. Select a sample to work with and import the data into R

3b. Identify markers in cells

3c. Processing expression data (dimensionality reduction)

3d. Plot of highly variable genes

3e. PCA, UMAP and t-SNE plots (plotting dimensionality reduction data)

3f. Remove low quality or outlier cells

5e. Differential expression: every cluster vs all other cells

Content Comparison

Versions Compared

Key

Aim

Requirements

Aim

Requirements

2a. Open RStudio and create a new R script

2b. Set your working directory

2c. Installing packages

2a. Open RStudio and create a new R script

2b. Set your working directory

2c. Installing packages

2d. Loading packages

2d. Loading packages

3. Analysis using the seurat package

3a. Select a sample to work with and import the data into R

3b. Identify markers in cells

3c. Processing expression data (dimensionality reduction)

3. Analysis using the seurat package

3a. Select a sample to work with and import the data into R

3b. Identify markers in cells

3c. Processing expression data (dimensionality reduction)

3d. Plot of highly variable genes

3e. PCA, UMAP and t-SNE plots (plotting dimensionality reduction data)

3f. Remove low quality or outlier cells

5e. Differential expression: every cluster vs all other cells