Aim
Analyse 10x Genomics* single cell RNA-Seq data.
This analysis workflow is split into 2 main sections: 1) ‘Upstream’ analysis on QUT’s HPC (high performance compute cluster) using a Nextflow workflow, nfcore/scrnaseq, 2) ‘Downstream’ analysis in R, primarily using the package seurat.
*Note: this workflow can be adapted to work with scRNA-Seq datasets generated by other sequencing technologies than 10x Genomics.
Requirements
A HPC account. If you do not have one, please request one here.
Access to an rVDI virtual desktop machine with 64GB RAM. For information about rVDI and requesting a virtual machine, see here.
Nextflow installed on your HPC home account. If you haven’t already installed Nextflow, do so by following the guide here.
Your scRNA-Seq data (fastq files) are on the HPC. If you are having difficulties transferring them to the HPC, submit a support ticket here.
Connect to an rVDI virtual desktop machine
Ensure you already have access to a 64GB RAM rVDI virtual machine, or request access by following the guide here.
scRNA-Seq datasets are often very large, requiring a lot of memory to run. Downstream analysis is run in R, on a Windows machine. Your PC is unlikely to have enough RAM, thus we’re using virtual machines with 64GB RAM. In addition, you can run the rest of this analysis in the virtual machine.
To access and run your rVDI virtual desktop:
Go to https://rvdi.qut.edu.au/
Click on ‘VMware Horizon HTML Access’
Log on with your QUT username and password
*NOTE: you need to be connected to the QUT network first, either being on campus or connecting remotely via VPN.
Set up your rVDI environment
Now that you’ve connected to an rVDI virtual machine, you’ll need to set it up to:
Connect to your home directory on the HPC, so you can access your data files
Install R and RStudio for running the Seurat analysis
Connect to your HPC home directory
Using the Windows File Explorer, we can map our HPC Home folder and the shared Work folder to drive letters. Here we will map our home drive to ‘H' and our shared work directory to ‘W’.
Open File Explorer (folder icon in the Windows task bar).
Right click “This PC” and choose Map Network Drive.
Home drive: select ‘H' as the drive letter, then copy and paste \\hpc-fs\home into the Folder box. Click 'Finish’.
Work drive: In file Explorer, again right click “This PC” and choose Map Network Drive. select ‘W' as the drive letter, then copy and paste \\hpc-fs\work into the Folder box. Click 'Finish’.
To see this demonstrated watch this video:
https://mediahub.qut.edu.au/media/t/0_ylaejs40
Now you’ll be able to browse and copy files between your virtual Windows machine and the HPC.
Installing R and RStudio
We’ll be analysing our data in an R environment, so first we need to install R and RStudio on the rVDI machine. You will be copying and pasting script from this workflow into RStudio.
Download and install R, following the default prompts:
Download R-4.3.2 for Windows. The R-project for statistical computing.
Download and install RStudio, following the default prompts:
https://posit.co/download/rstudio-desktop/
1. nfcore/scrnaseq
10x scRNA-Seq data is typically processed using various Cell Ranger software tools. These (and other) tools have been combined in an nfcore Nextflow workflow called scrnaseq.
NOTE: sometimes your 10x data has already been processed by your sequencing company, using Cell Ranger. In this case you can skip the nfcore/scrnaseq analysis and go straight to the downstream Seurat analysis.
1a. Workflow overview
As can be seen in the workflow below, there are several workflow options. The one we’ll be using is the complete Cell Ranger workflow, using the tools cellranger mkgtf and cellranger mkref for reference genome preparation and cellranger count for both aligning sequences to the reference genome and quantifying expression per gene per cell, for each sample.
1b. Creating a samplesheet
To run, nfcore/scrnaseq requires: 1) Your data files in gzipped fastq format (*fastq.gz), 2) A samplesheet that lists the sample names and the fastq files associated with each sample.
See this page for the required structure and content of the samplesheet:
https://nf-co.re/scrnaseq/2.4.1/docs/usage
Because sample names are specific to a project, and typically a single samples is associated with multiple fastq files, you will need to manually create this samplesheet. You can create it in Excel, then save it as a comma-separated file called ‘samplesheet.csv’, then copy this up to the HPC. Or you can manually create it on the command-line in the HPC, using a text editor such as nano.
Note: in the samplesheet, you must provide the full path for the fastq files. This is not shown in the nfcore/scrnaseq usage guide.
You can find the full path by typing pwd
in the command-line, while in the directory containing your fastq files.
For example, if your fastq files are in /home/username/mydata
and you have a fastq file called Liver_S2_L001_I1_001.fastq.gz
then the full path for that fastq file would be: /home/username/mydata/Liver_S2_L001_R1.fastq.gz
An example datasheet, with 2 samples (Liver and Kidneys), where each sample has 2 fastq files associated with it, might look something like this:
sample, | fastq_1, | fastq_2, |
---|---|---|
Liver, | /home/username/mydata/Liver_L001_R1.fastq.gz, | /home/username/mydata/Liver_L001_R1.fastq.gz, |
Liver, | /home/username/mydata/Liver_L002_R1.fastq.gz, | /home/username/mydata/Liver_L002_R1.fastq.gz, |
Kidney, | /home/username/mydata/Kidney_L001_R1.fastq.gz, | /home/username/mydata/Kidney_L001_R1.fastq.gz, |
Kidney, | /home/username/mydata/Kidney_L001_R1.fastq.gz, | /home/username/mydata/Kidney_L001_R1.fastq.gz, |
1c. Running nfcore/scrnaseq as a PBS script
2. Downstream analysis in R
We’ll be using various R packages to analyse our 10x single cell data. In this section we’ll create an R script in RStudio and install the required R packages.
2a. Open RStudio and create a new R script
RStudio is a GUI (graphical user interface) for R. It makes navigating R easier.
Open RStudio (you can type it in the Windows search bar)
Create a new R script: ‘File’ → “New File” → “R script”
Save this script where your samples folders are (‘File’ → ‘Save’). These should be on your H or W drive. Save the script file as ‘scrnaseq.R’
In the following sections you will be copying and running the R code into your scrnaseq.R script.
Cell Ranger (and nfcore/scrnaseq) generates a default folder and file output structure. There will be a main folder that contains all the sample subfolders (NOTE: this is where you must save your R script). Each sample folder will have an ‘outs’ subfolder. This ‘outs’ folder contains a ‘filtered_feature_bc_matrix’ folder, which contains the files that Seurat uses in its analysis.
If you have already created this script in a previous session, you can just re-open it by browsing to the location (somewhere in your mapped H drive, probably) and double clicking on it.
2b. Set your working directory
In R, your working directory is the location from which your data files are read in to R and any output files are deposited. For our purposes we need to set the working directory to the location on the HPC where your scRNASeq dataset is.
Most of the scripts can be run without modification, but there are a few lines that you will need to change, such as the working directory (which will differ for each researcher’s dataset).
When you see **USER INPUT** in the script, this means you have to modify the line below this.
You can manually set your working directory in RStudio by selecting ‘Session' -> 'Set working directory' -> 'Choose directory'. Choose the same directory as you saved your scrnaseq.R script, previous section. This will output the setwd(...)
command with your working directory into the console window (bottom left panel). Copy this command to replace the default setwd(...)
line in your R script.
Copy and paste the following code into the R script you just created, then run the code (highlight all the code in your R script, then press the run button).
#### 2b. Set your working directory #### # Change the below to the directory that contains your sample folders (you may have to browse H or W drive to find this) # **USER INPUT** setwd("H:/sam_dando/dataset1/count") # You can see the sample subdirectories by: list.dirs(full.names = F, recursive = F) # You should see directories that are names after your samples. # If you don't see this, browse through your H or W drives to find the correct path for your sample directories.
2c. Installing packages
This will install all the required packages and dependencies and may take 30 minutes or more to complete. It may prompt you occasionally to update packages - select 'a' for all if/when this occurs.
#### 2c. Installing required packages #### # This section only needs to be run once on a computer. # One the packages are installed, they need to be loaded every time they will be used (next section) # Create vector of required package names bioconductor_packages <- c("clusterProfiler", "pathview", "AnnotationHub", "org.Mm.eg.db") cran_packages <- c("Seurat", "patchwork", "ggplot2", "tidyverse", "viridis", "plyr", "readxl", "scales") # Compares installed packages to above packages and returns a vector of missing packages new_packages <- bioconductor_packages[!(bioconductor_packages %in% installed.packages()[,"Package"])] new_cran_packages <- cran_packages[!(cran_packages %in% installed.packages()[,"Package"])] # Install missing bioconductor packages if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(new_packages) # Install missing cran packages if (length(new_cran_packages)) install.packages(new_cran_packages, repos = "http://cran.us.r-project.org") # Update all installed packages to the latest version update.packages(bioconductor_packages, ask = FALSE) update.packages(cran_packages, ask = FALSE, repos = "http://cran.us.r-project.org")
2d. Loading packages
#### 2d. Loading required packages #### # This section needs to be run every time # Load packages bioconductor_packages <- c("clusterProfiler", "pathview", "AnnotationHub", "org.Mm.eg.db") cran_packages <- c("Seurat", "patchwork", "ggplot2", "tidyverse", "viridis", "plyr", "readxl", "scales") lapply(cran_packages, require, character.only = TRUE) lapply(bioconductor_packages, require, character.only = TRUE)
3. Analysis using the seurat package
Seurat is:
A toolkit for quality control, analysis, and exploration of single cell RNA sequencing data. 'Seurat' aims to enable users to identify and interpret sources of heterogeneity from single cell transcriptomic measurements, and to integrate diverse types of single cell data.
In this section 3 we’ll be using a variety of Seurat functions to analyse our scRNA-Seq data.
3a. Select a sample to work with and import the data into R
################################################################
#TO DO: Make sure the folder structure provided follows the Cell ranger output format.
├── filtered_feature_bc_matrix │ ├── barcodes.tsv.gz │ ├── features.tsv.gz │ └── matrix.mtx.gz
Clarify how the files will be output from the upstream Nextflow pipeline and update text accordingly to guide to the output folder.
mat <- Read10X(data.dir = "H:/nfcore-scrnaseq_R_analysis/test_data/Choroid/filtered_feature_bc_matrix"
################################################################
In this section we’ll choose one of our samples to analyse, then import that sample’s scRNA-Seq dataset into R. You can re-run this, and following sections, for each sample dataset that you have. Just replace the sample name in sample <- "xxxxxx"
below.
#### 3a. Choose a sample to work with and import the data for that sample into R #### # Give the sample name here that you want to work with. # To see the available samples (choose a sample name from this list): list.dirs(full.names = F, recursive = F) ## **USER INPUT** sample <- "Cerebellum" # Use Seurat's 'Read10X()' function to read in the full sample database. Cell Ranger creates 3 main database files that need to be combined into a single Seurat object. # Note: these datasets can be very large and take several minutes to import into R. mat <- Read10X(data.dir = paste0(sample, "/outs/filtered_feature_bc_matrix")) # Have a look at the top 10 rows and columns to see if the data has been imported correctly. You should see gene IDs as rows and barcodes (i.e. cells) as columns as.matrix(mat[1:10, 1:10]) # Now convert this to a Seurat object. Again, this may take several minutes to load and use a lot of memory mat2 <- CreateSeuratObject(counts = mat, project = sample) # You can see a summary of the data by simply running the Seurat object name mat2 # Set a colour palette that can contrast multiple clusters when you plot them. # You can change these colours as you like. # You can see what R colours are available here: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf c25 <- c("dodgerblue2", "#E31A1C", "green4", "#6A3D9A", "#FF7F00", "black", "gold1", "skyblue2", "#FB9A99", "palegreen2", "#CAB2D6", "#FDBF6F", "gray70", "khaki2", "maroon", "orchid1", "deeppink1", "blue1", "steelblue4", "darkturquoise", "green1", "yellow4", "yellow3", "darkorange4", "brown")
3b. Identify markers in cells
Now we're going to identify individual markers that were present (i.e. were expressed) in our dataset.
IMPORTANT: the gene symbols you provide in the following section have to exactly match the gene symbols in your dataset (including capitalisation).
Gene symbols are more like 'common names' and can vary between databases.
Your main gene identifiers are Ensembl IDs and we need to find the gene symbols that match these Ensembl IDs.
For example, the gene P2ry12 is also called ADPG-R, BDPLT8, HORK3 and various other IDs, depending on the database it's listed in.
In the Ensemble database it's listed as P2ry12 (not P2RY12, remember, case matters) and matches Ensembl ID ENSMUSG00000036353.
For this reason it's advisable to first search the Ensembl website for your markers of interest and for your organism, to ensure you are providing gene symbols that match the Ensembl IDs.
https://asia.ensembl.org/Mus_musculus/Info/Index
#### 3b. Identify markers in cells #### # Create a vector called 'markers' that contains each of the markers you want to examine. # These should be gene symbols. Replace the gene symbols below with your target markers. ## **USER INPUT** markers <- c("P2ry12", "Tmem119", "Itgam") # You can see if the markers you provided are present: sum(row.names(mat) %in% markers) # If you input 3 markers and the output from the above code = 3, then all are present. # If the result is 2 then 2 of the 3 markers you provided are found in your data, etc. ## USER INPUT # You can see if an individual marker is present like so (substitute for a marker of choice): sum(row.names(mat) == "P2ry12") # Outputs 1 if the marker is present, 0 if it isn't # Pull out just the read counts for your defined markers y <- mat[row.names(mat) %in% markers, ] # Now we can count the number of cells containing zero transcripts for each of the examined markers. # This enables an examination of the number of cells that have zero expression for these markers, # and therefore the number of cells that can be considered non-target cells. # First count all cells # Then make a loop to cycle through all markers (defined in previously created 'markers' vector) a <- length(colnames(y)) for (i in 1:length(markers)) { a <- c(a, sum(y[i,] == 0)) } # Do a sum of the columns y2 <- colSums(y) # See if any zeros. If so, these cells are not target cells (as determined by absence of any target cell markers) count <- c(a, sum(y2 == 0)) # Name the vector elements names(count) <- c("Total_cells", markers, "All_zero") # Generate the table as.data.frame(count) # The generated table shows: # 1) the total number of cells for your sample ("Total_cells" row), # 2) the number of cells which had 0 expression for each marker, and, # 3) the number of cell that had zero expression for all of the markers you provided ("All_zero" row).
3c. Processing expression data (dimensionality reduction)
There are a variety of methods to visualise expression in single cell data. The most commonly used methods - PCA, t-SNE and UMAP - involve 'dimensionality', i.e. converting expression to x-n dimensions (which can then be plotted) based on gene expression per cell.
Seurat can generate and store PCA, t-SNE and UMAP data in the Seurat object we created previously ('mat2'), but first the raw data needs to be processed in a variety of ways:
Normalise the data by log transformation
Identify genes that exhibit high cell-to-cell variation
Scale the data so that highly expressed genes don't dominate the visual representation of expression
Perform the linear dimensional reduction that converts expression to dimensions
Plot the x-y dimension data (i.e. first 2 dimensions)
The first 4 steps are completed in the code below (this may take a few minutes to run)
#### 3c. Processing expression data (dimensionality reduction) #### # Normalise data mat3 <- NormalizeData(mat2) # Identification of variable features mat3 <- FindVariableFeatures(mat3, selection.method = "vst", nfeatures = nrow(mat3)) # Scaling the data all.genes <- rownames(mat3) mat3 <- ScaleData(mat3, features = all.genes) # Perform linear dimensional reduction (PCA. This may take several minutes to run) mat3 <- RunPCA(mat3, features = VariableFeatures(object = mat3))
3d. Plot of highly variable genes
################################################################
Question for Paul:
The following command returns a warning.
p <- LabelPoints(plot = p, points = top_genes, repel = TRUE) + theme_bw() + theme(text = element_text(size = 17)) p
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation' This message will be shown once per session
################################################################
Using the FindVariableFeatures
results, we can visualise the most highly variable genes, including a count of variable and non variable genes in your dataset. The below code outputs the top 10 genes, but you can adjust this number as desired (i.e. in top_genes <- head(VariableFeatures(mat3), 10)
change 10
to another number).
NOTE In the below plot you can change a number of parameters to modify the plot to look how you like. This can be done for any of the plots in these notebooks. In the plot below you can change:
Dot size: pt.size = 2
. Increase or decrease the number to increase or decrease dot size.
Dot colours: cols = c("black", "firebrick"))
. Change the colours to whatever you like. A list of R colour names is here: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
Theme: theme_bw()
. There are several default plot themes you can choose from, that change a variety of plot parameters. See here: https://ggplot2.tidyverse.org/reference/ggtheme.html
Axis text size: theme(text = element_text(size = 17))
. There are a large number of parameters that can be modified with theme()
. Here we've just changed the axis text to size 17. See here for other parameters that can be changed with theme()
: https://ggplot2.tidyverse.org/reference/theme.html
Seurat plots are based on the ggplot package. There are a multitude of other modifications you can make to a ggplot, too many to describe in this notebook. But there are plenty of online guides on how to modify ggplot plots. Here's an example: http://www.sthda.com/english/wiki/be-awesome-in-ggplot2-a-practical-guide-to-be-highly-effective-r-software-and-data-visualization
#### 3d. Plot of highly variable genes #### # Identify the 10 most highly variable genes top_genes <- head(VariableFeatures(mat3), 10) # plot variable features with labels p <- VariableFeaturePlot(mat3, pt.size = 2, cols = c("black", "firebrick")) p <- LabelPoints(plot = p, points = top_genes, repel = TRUE) + theme_bw() + theme(text = element_text(size = 17)) p # You can save your plot as a 300dpi (i.e. publication quality) tiff or pdf file. # These files can be found in your working directory. # You can adjust the width and height of the saved images by changing width = and height = in the code below. # Export as a 300dpi tiff tiff_exp <- paste0(sample, "_top_genes.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") # Export as a pdf pdf_exp <- paste0(sample, "_top_genes.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")
3e. PCA, UMAP and t-SNE plots (plotting dimensionality reduction data)
################################################################
Question for Paul:
one warning comes up when running RunUMAP
:
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation' This message will be shown once per session
################################################################
In the section 3c we ran dimensionality reduction based on Principal Component Analysis (PCA).
There are other dimensionality reduction methods that are more commonly presented in singe cell papers, particularly Uniform Manifold Approximation and Projection (UMAP and t-distributed stochastic neighbour embedding (t-SNE) dimensionality reduction, so here we will generate and visualise dimensionality reduction from all 3 methods (PCA, UMAP, t-SNE).
#### 3e. PCA, UMAP and t-SNE plots #### # First we need to also run UMAP and tSNE dimensionality reduction and add these results to our main Seurat object. # As with PCA, these may take several minutes to run mat3 <- RunUMAP(mat3, dims = 1:3, verbose = F) mat3 <- RunTSNE(mat3, dims = 1:3, verbose = F) # Now we can plot all 3 results. You can modify the code to alter the appearance of these plots as described in section 2h. # PCA plot p <- DimPlot(mat3, reduction = "pca", pt.size = 2, label = TRUE, label.size = 6, label.color = "white", cols = c("firebrick")) + theme_bw() + theme(legend.position="none", axis.title=element_text(size=16), axis.text=element_text(size=14)) p # Export as a TIFF and PDF tiff_exp <- paste0(sample, "_PCA_pre_filtration.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "_PCA_pre_filtration.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # UMAP p <- DimPlot(mat3, reduction = "umap", pt.size = 2, label = TRUE, label.size = 8, label.color = "black", cols = c("firebrick")) + theme_bw() + theme(legend.position="none", axis.title=element_text(size=16), axis.text=element_text(size=14)) p tiff_exp <- paste0(sample, "_umap_pre_filtration.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "_umap_pre_filtration.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # t-SNE p <- DimPlot(mat3, reduction = "tsne", pt.size = 2, label = TRUE, label.size = 8, label.color = "black", cols = c("firebrick")) + theme_bw() + theme(legend.position="none", axis.title=element_text(size=16), axis.text=element_text(size=14)) p tiff_exp <- paste0(sample, "_tsne_pre_filtration.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "_tsne_pre_filtration.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")
3f. Remove low quality or outlier cells
From the Seurat website:
Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. A few QC metrics commonly used by the community include
- Low-quality cells or empty droplets will often have very few genes
- Cell doublets or multiplets may exhibit an aberrantly high gene count
So in this section we can filter out cells that have very low and very high gene counts.
First you’ll need to visualise the spread of genes and reads using a violin plot and/or scatter plot.
You can then choose to filter out the top and bottom 'outliers' based on these violin and scatter plots, by entering a maximum and minimum nFeature_RNA
below.
This max/min number can vary greatly depending on the sequencing depth of your samples and other factors. Use the violin plot to guide your decision.
For example, if your violin plot looked like the below figure, a reasonable nFeature_RNA
maximum would be 4000 and a minimum of 200. This would remove outliers.
#### 3f. Remove low quality or outlier cells #### # Generate a violin plot. VlnPlot(mat2, features = c("nFeature_RNA", "nCount_RNA"), ncol = 3) # And a scatter plot FeatureScatter(mat2, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") # Use these plot to choose and enter a minimum (nFeature_RNA > xxx) and maxiumum (nFeature_RNA < xxxx) below mat3_filt <- subset(mat3, subset = nFeature_RNA > 200 & nFeature_RNA < 4000)
3g. Visualise gene expression by marker
In the section 3e (plotting dimensionality reduction data) we visualised total expression, i.e. all genes within each cell.
In this section we will visualise expression for specific markers within each cell, using the same dimensionality reduction data (PCA, UMAP and t-SNE) that we generated in the previous section.
This has a variety of uses: to identify patterns of differential expression between cells for specific markers, identify 'non-target' cells - i.e. expression of markers that are known to be not expressed in target cells, marker-based heterogeneity of expression, etc.
In section 2f you selected a set of markers. To confirm which markers they are:
markers
If you wish to plot a different set of markers, you can do so by changing the set of markers in the markers <- c(..)
code and re-running that line. You can choose 1 marker, or as many as you like. Be aware though that there will be a plot generated for every marker provided.
Now generate the plots. You can change the colours in the plots (cols = c("red", "lightgrey")
and the point size (pt.size = 1
). Note that the default colours show the lowest expression in red. This is so you can more easily see which cells don't express the diagnostic markers.
In addition to the dimensionality reduction plots, you can visualise expression for your selected markers with a heatmap (note: this requires at least 2 markers, preferably several, to be visually meaningful). You can change the colour range as you like by providing a high, centre and low colour (scale_fill_gradientn(colors = c("darkorange", "floralwhite", "dodgerblue4"))
)
#### 3g. Visualise gene expression by marker #### # PCA p <- FeaturePlot(mat3_filt, features = markers, reduction = "pca", cols = c("red", "lightgrey"), pt.size = 1) p # Export as a 300dpi tiff tiff_exp <- paste0(sample, "_pca_markers.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") # Export as a pdf pdf_exp <- paste0(sample, "_pca_markers.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # UMAP p <- FeaturePlot(mat3, features = markers, reduction = "umap", cols = c("red", "lightgrey"), pt.size = 1) p tiff_exp <- paste0(sample, "_umap_markers.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "_umap_markers.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # t-SNE p <- FeaturePlot(mat3, features = markers, reduction = "tsne", cols = c("red", "lightgrey"), pt.size = 1) p tiff_exp <- paste0(sample, "_tsne_markers.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "_tsne_markers.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # Heatmap p <- DoHeatmap(mat3, features = markers, group.bar = FALSE) + scale_fill_gradientn(colors = c("darkorange", "floralwhite", "dodgerblue4")) + theme(text = element_text(size = 16)) p tiff_exp <- paste0(sample, "_hmap_markers.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "_hmap_markers.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # Clustered heatmap # You can also cluster the cells and then plot a heatmap based on this cluster data, to see if expression of your chosen markers is related to clusters # First, generate the clusters for your sample # Choose the `resolution =` score carefully (between 0 and 1). A lower score means fewer clusters. # You can adjust this higher or lower to see how it affects your clusters. mat3 <- FindNeighbors(mat3, dims = 1:10) mat3 <- FindClusters(mat3, resolution = 0.5) # Then generate the heatmap p <- DoHeatmap(mat3, features = markers, raster = T) + scale_fill_gradientn(colors = c("darkorange", "floralwhite", "dodgerblue4")) + theme(text = element_text(size = 16)) + labs(color = "Dose (mg)") p tiff_exp <- paste0(sample, "_hmap_markers_clustered.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "_hmap_markers_clustered.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")
3h. Remove non-target cells
In this section we will remove any 'non-target' cells from our dataset. 'Non-target' cells are defined as those with 0 reads (i.e. 0 expression) for our marker(s) of choice.
If you don't want to remove any cells based on expression of specific markers, skip this section
#### 3h. Remove non-target cells #### # View your chosen markers zm <- mat[row.names(mat) %in% markers, ] as.matrix(zm) # Then we can again see a count of cells that have 0 reads (thus 0 expression) for each marker. # The 'All markers' row indicates the number of cells that have zero expression in any of the provided markers. # These are the cells that will be filtered out from your dataset a <- length(colnames(zm)) for (i in 1:length(markers)) {a <- c(a, sum(zm[i,] == 0))} a <- c(a, length(colnames(zm)) - sum(apply(as.matrix(zm) == 0, 2, sum) == 0)) names(a) <- c("Total cells", markers, "All markers") as.data.frame(a) # Enter the markers that you want to use in the filtration. marker_rem <- c("P2ry12", "Tmem119") # Remove cells from main Seurat object that have zero expression for **any** of these markers. zm <- mat[row.names(mat) %in% marker_rem, ] # This line does a sum of every column and then outputs column where this = 0 (if any cell contains reads, this will at least = 1). zm_1 <- as.matrix(zm)[, apply(as.matrix(zm) == 0, 2, sum) == 0] # Then we can filter the Seurat object to contain just these cell (i.e. barcode) IDs mat3_filt <- subset(mat3_filt, cells = colnames(zm_1)) mat3_filt
3i. Clustering by gene expression
This section examines clustering by gene expression similarity for each sample. PCA, t-SNE and UMAP plots are used to visualize the gene expression patterns and clusters.
This section is split into three subsections - “1) Choosing the correct resolution”, where you’ll use the clustree tool to identify the optimal number of clusters, and “2) Calculate the clusters” where you’ll use the clustree results to generate optimised clusters, and “3) Plot the results“, where you’ll generate 'before and after' plots, to visualise how filtration (inc. removing the non-target cells) changed the data structure.
You can use this section to examine if your cell filtration had a meaningful effect on your data structure. If it didn't, you may want to choose a different set of markers or filtering parameters to filter with.
1) Choosing the correct resolution
A cluster represents a unique group of cells, based on gene expression patterns. But what consitutes 'unique'? When you calculate the clustering (using Seurat's FindClusters()
function), it's important to use the correct resolution
score to generate accurate, biologically meaningful clusters. Using a lower resolution
score will generate fewer clusters (but you risk combining two clusters that should be distinct), a higher score will generate more clusters (but you risk falsely splitting a biologically relevant cluster of cells). Every single cell dataset is different (cell population similarity, sequencing depth, etc) and as such the optimal resolution
score needs to be chosen for each dataset.
The package clustree generates a tree based on multiple resolution
scores, which can help you in picking the optimal score.
Read the clustree manual to understand how to interpret the generated tree: https://cran.r-project.org/web/packages/clustree/vignettes/clustree.html
#### 3i(1). Choosing the correct resolution #### # Install and load the clustree package: install.packages("clustree") library(clustree) # Then generate a range of clusters, from 0 to 1, at 0.1 increments ("resolution = seq(0, 1, 0.1)"). mat3_clust <- FindNeighbors(mat3_filt, dims = 1:10) mat3_clust <- FindClusters(mat3_clust, resolution = seq(0, 1, 0.1), verbose = F) # Convert the results into a Seurat object, which can be used as input into "clustree()" clus_seurat <- CreateSeuratObject(counts = mat3_clust@assays$RNA$counts, meta.data = mat3_clust[[]]) clus_seurat[['TSNE']] <- CreateDimReducObject(embeddings = Embeddings(object = mat3_clust, reduction = "pca"), key = "tSNE_") # Generate the tree. # Refer to the clustree manual for tips on how to use this tree to choose the optimal `resolution` # https://cran.r-project.org/web/packages/clustree/vignettes/clustree.html clustree(clus_seurat, prefix = "RNA_snn_res.") + scale_color_manual(values=c25) + scale_edge_color_continuous(low = "blue", high = "red")
2) Calculate the clusters
In this subsection we’ll re-cluster the data based on the clustree results.
Choose the the resolution = xx
score in FindClusters()
to the optimal resolution that we can see in the clustree plot (e.g. 0.5, or 0.6, etc.).
#### 3i(2). Calculate the clusters #### # First you need to re-run the variable feature calculation, scaling, dimensionality reduction (PCA, t-SNE and UMAP). # This may take several minutes to run. mat3_filt <- FindVariableFeatures(mat3_filt, selection.method = "vst", nfeatures = nrow(mat3_filt)) all.genes <- rownames(mat3_filt) mat3_filt <- ScaleData(mat3_filt, features = all.genes) mat3_filt <- RunPCA(mat3_filt, dims = 1:3, verbose = F) mat3_filt <- RunUMAP(mat3_filt, dims = 1:3, verbose = F) mat3_filt <- RunTSNE(mat3_filt, dims = 1:3, verbose = F) # Then you generate a 'nearest neighbor' graph by calculating the neighborhood overlap (Jaccard index) # between every cell and identify clusters of cells based on shared nearest neighbor (SNN). mat3_filt <- FindNeighbors(mat3_filt, dims = 1:10) ## **USER INPUT** Enter a resolution based off the clustree plot mat3_filt <- FindClusters(mat3_filt, resolution = 0.6) # You can see how many cells there are per cluster like so: cellcount <- as.data.frame(table(mat3_filt@meta.data[4])) names(cellcount) <- c("Cluster", "Cell_count") cellcount
3) Plot the results
Here we’ll plot ‘before and after’ filtration plots, to examine how filtration affected the data structure.
We’ll plot all clusters, for PCA, t-SNE and UMAP, and you can also choose an individual cluster to plot for any of these dimensionality reduction plot types.
#### 3i(3). Plot the results #### ## PCA Plot ## # Before filtration: p <- DimPlot(mat3, reduction = "pca", pt.size = 2, cols = c25) + theme_bw() + theme(legend.title=element_text(size=14), legend.text = element_text(size = 14), axis.title=element_text(size=18), axis.text=element_text(size=14)) + labs(color="Cluster") p # After filtration: p <- DimPlot(mat3_filt, reduction = "pca", pt.size = 2, cols = c25) + theme_bw() + theme(legend.title=element_text(size=14), legend.text = element_text(size = 14), axis.title=element_text(size=18), axis.text=element_text(size=14)) + labs(color="Cluster") p # Export as pdf and tiff tiff_exp <- paste0(sample, "_pca_filtered.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "_pca_filtered.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") ## PCA: Plot one individual cluster ## # Here you can visualise a single cluster by colouring it a specific colour. # First define the colours, based on the cluster information in your seurat data. # You can change the background colours ("gray70") and the highlighted cluster colour ("red") to whatever you like. newcols <- rep("gray70", length(levels(mat3_filt))-1) newcols <- c(newcols, "red") # Enter which cluster you want to visualise in the `myclust` variable below. # This is based on the cluster numbers in the previous plot. ## **USER INPUT** myclust <- 3 # Plot the cluster, placing this cluster on top (the `order` parameter). p <- DimPlot(mat3_filt, reduction = "pca", pt.size = 2, cols = newcols, order = myclust) + theme_bw() + theme(legend.title=element_text(size=14), legend.text = element_text(size = 14), axis.title=element_text(size=18), axis.text=element_text(size=14)) + labs(color="Cluster") p # Export as a pdf and tiff tiff_exp <- paste0(sample, "_clust_", myclust, "_pca_filtered.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "_clust_", myclust, "_pca_filtered.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") ## UMAP ## # Before filtration: p <- DimPlot(mat3, reduction = "umap", pt.size = 2, cols = c25) + theme_bw() + theme(legend.title=element_text(size=14), legend.text = element_text(size = 14), axis.title=element_text(size=16), axis.text=element_text(size=14)) + labs(color="Cluster") p # After filtration: p <- DimPlot(mat3_filt, reduction = "umap", pt.size = 2, cols = c25) + theme_bw() + theme(legend.title=element_text(size=14), legend.text = element_text(size = 14), axis.title=element_text(size=18), axis.text=element_text(size=14)) + labs(color="Cluster") p # Export as a pdf and tiff tiff_exp <- paste0(sample, "_umap_filtered.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "_umap_filtered.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") ## UMAP: Plot individual clusters ## # Select your colours newcols <- rep("gray70", length(levels(mat3_filt))-1) newcols <- c(newcols, "red") # Select your cluster and then plot it ## **USER INPUT** myclust <- 3 p <- DimPlot(mat3_filt, reduction = "umap", pt.size = 2, cols = newcols, order = myclust) + theme_bw() + theme(legend.title=element_text(size=14), legend.text = element_text(size = 14), axis.title=element_text(size=18), axis.text=element_text(size=14)) + labs(color="Cluster") p # Export as a pdf and tiff tiff_exp <- paste0(sample, "_clust_", myclust, "_umap_filtered.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "_clust_", myclust, "_umap_filtered.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") ## t_SNE ## # Before filtration: p <- DimPlot(mat3, reduction = "tsne", pt.size = 2, cols = c25) + theme_bw() + theme(legend.title=element_text(size=14), legend.text = element_text(size = 14), axis.title=element_text(size=18), axis.text=element_text(size=14)) + labs(color="Cluster") p # After filtration: p <- DimPlot(mat3_filt, reduction = "tsne", pt.size = 2, cols = c25) + theme_bw() + theme(legend.title=element_text(size=14), legend.text = element_text(size = 14), axis.title=element_text(size=18), axis.text=element_text(size=14)) + labs(color="Cluster") p # Export as a pdf and tiff tiff_exp <- paste0(sample, "_tsne_filtered.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "_tsne_filtered.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") ## t_SNE: Plot individual clusters ## # Select your colours newcols <- rep("gray70", length(levels(mat3_filt))-1) newcols <- c(newcols, "red") # Select your cluster and then plot it. myclust <- 3 p <- DimPlot(mat3_filt, reduction = "tsne", pt.size = 2, cols = newcols, order = myclust) + theme_bw() + theme(legend.title=element_text(size=14), legend.text = element_text(size = 14), axis.title=element_text(size=18), axis.text=element_text(size=14)) + labs(color="Cluster") p # Export as a pdf and tiff tiff_exp <- paste0(sample, "_clust_", myclust, "_tsne_filtered.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "_clust_", myclust, "_tsne_filtered.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") ## Clusters - Heatmap of selected markers ## # Recall which markers you previously defined: markers # Before filtration: p <- DoHeatmap(mat3, features = markers, raster = T) + scale_fill_gradientn(colors = c("darkorange", "floralwhite", "dodgerblue4")) + theme(text = element_text(size = 16)) + labs(color = "Dose (mg)") p # After filtration: p <- DoHeatmap(mat3_filt, features = markers, raster = T) + scale_fill_gradientn(colors = c("darkorange", "floralwhite", "dodgerblue4")) + theme(text = element_text(size = 16)) + labs(color = "Dose (mg)") p # Export as a pdf and tiff tiff_exp <- paste0(sample, "_hmap_filtered.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "_hmap_filtered.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")
3j. Plot marker expression before and after filtration
Then we can see how the expression of specific markers changed before and after filtration.
#### 3j. Plot marker expression before and after filtration #### # Select your set of markers (you could re-enter the same ones you used earlier, or use a different set): ## **USER INPUT** markers <- c("P2ry12", "Tmem119") # Select which type of plot you want to generate ("pca", "umap" or "tsne"): ## **USER INPUT** redplot <- "pca" ## Before filtration ## p <- FeaturePlot(mat3, features = markers, reduction = redplot, cols = c("lightgrey", "red"), pt.size = 1) p # Export as a pdf and tiff tiff_exp <- paste0(sample, "_", redplot, "_markers.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "_", redplot, "_markers.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") ## After filtration ## p <- FeaturePlot(mat3_filt, features = markers, reduction = redplot, cols = c("lightgrey", "red"), pt.size = 1) p # Export as a pdf and tiff tiff_exp <- paste0(sample, "_", redplot, "_markers_filtered.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "_", redplot, "_markers_filtered.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")
3k. Output filtered results
Now will export the filtered dataset (non-target cells removed, other filtration applied), for analysis in the next sections of this analysis workflow.
You need to run this entire '3. Analysis using the seurat package' section once for every sample you have, because the other sections use the output created here for each sample.
This will create a file called <samplename>_seurat_filtered.rds
in your working directory. This contains all the data in your mat3_filt
object, including raw counts, filtered counts, gene and sample IDs, dimensionality reduction data, etc.
#### 3k. Output filtered results #### # Output the filtered Seurat object as a file # This file will be imported in the other sections of this workflow # This is a large amount of data, so may take a few minutes saveRDS(mat3_filt, file = paste0(sample, "_seurat_filtered.rds"))
4. Aggregate clustering
This section of the report involves combining (aggregating) two or more sample datasets. This allows a direct comparison of these samples, rather than analysing them separately.
IMPORTANT: To run this section, you must have processed all the samples you wish to combine in the '3. Analysis using the seurat package' section. This means that outliers and non-target cells have already been identified and removed. Section 3 only has to run once for each sample, as it outputs a datafile for each sample that is imported into this section.
ALSO IMPORTANT: Aggregate datasets can be very large and utilise a lot of memory. Keep an eye on your memory usage (in the Environment tab in RStudio). If you use all available memory your rVDI machine may crash.
4a. Choose samples to aggregate and import data
Running your sample through section 3 created a datafile called <samplename>_seurat_filtered.rds
, so if your sample was called 'liver', the file would be liver_seurat_filtered.rds
.
These datafiles will be in your working directories.
#### 4a. Choose samples to aggregate and import data #### # See which samples you've run through section 3 # This will list every <samplename>_seurat_filtered.rds data file that you've generated dir(pattern = "seurat_filtered.rds") # Enter the sample names you wish to aggregate from the list above (just the names, without the '_seurat_filtered.rds'). ## **USER INPUT** samples <- c("Cerebellum", "Cortex") # Give your aggregate samples a name (for example, "high_low"). This is for naming plots and such. sample <- "Cerebellum_Cortex" # Now import the data files for those samples. samplist <- list() for (i in 1:length(samples)){ assign(samples[i], readRDS(paste0(samples[i], "_seurat_filtered.rds"))) samplist[[i]] <- get(samples[i]) } # Then merge these datasets (note that this tags your cells with your sample names) data <- merge(samplist[[1]], y = samplist[[2:length(samplist)]], add.cell.ids = samples, project = sample) # See a summary of your data object (note that 'samples' means cells and 'features' means genes) data # This should contain the combined number of cells from all your samples. # You can see the number of cells in each sample by typing one of your combined sample names below. ## **USER INPUT** Cerebellum
4b. Processing expression data (dimensionality reduction)
This is essentially a repeat of dimensionality reduction for individual samples in section 3, so see that section for details. The following steps are required to generate dimensionality reduction analysis (PCA, T-SNE and UMAP) for the aggregate data.
NOTE: this regenerates clustering based on the combined data. Any plots and analysis in this section are therefore based on this, rather than the individual sample clustering. If you wish to compare clustering per sample, you'll need to use the results in section 3, where you analysed each sample separately.
#### 4b. Processing expression data (dimensionality reduction) #### # Normalise data data1 <- NormalizeData(data) # Identification of variable features data1 <- FindVariableFeatures(data1, selection.method = "vst", nfeatures = nrow(data1)) # Scaling the data all.genes <- rownames(data1) data1 <- ScaleData(data1, features = all.genes) # Perform linear dimensional reduction (PCA) data1 <- RunPCA(data1, features = VariableFeatures(object = data1)) # The above normalises and scales the data, finds variable features and does the initial dimensionality reduction. # Now we can run t-SNE and UMAP reduction. data1 <- RunUMAP(data1, dims = 1:3, verbose = F) data1 <- RunTSNE(data1, dims = 1:3, verbose = F) # As in section 3, we can view the top variable genes, but this time for the combined samples. # See section 3 for details on modifying the following plot: # Identify the 10 most highly variable genes top_genes <- head(VariableFeatures(data1), 10) # plot variable features with labels p <- VariableFeaturePlot(data1, pt.size = 2, cols = c("black", "firebrick")) p <- LabelPoints(plot = p, points = top_genes, repel = TRUE) + theme_bw() + theme(text = element_text(size = 17)) p # Export as a pdf and tiff tiff_exp <- paste0(sample, "_top_genes.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "_top_genes.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # Now we can visualise your expression data using all 3 dimensionality reduction methods (next section)
4c. PCA, UMAP and t-SNE plots (plotting dimensionality reduction data)
#### 4c. PCA, UMAP and t-SNE plots (plotting dimensionality reduction data) #### # PCA plot p <- DimPlot(data1, reduction = "pca", group.by = 'orig.ident', pt.size = 2, cols = alpha(c25[1:length(samples)],0.5)) + theme_bw() + theme(axis.title=element_text(size=16), axis.text=element_text(size=14)) p # The above generates a combined plot. # Alternatively, you can separate the samples into 'facets' in a single plot, by running the following: # NOTE: run either the combined or the split plot. Exporting the plot only exports the last plot you generated. p <- DimPlot(data1, reduction = "pca", split.by = 'orig.ident', group.by = 'orig.ident', pt.size = 2, cols = alpha(c25[1:length(samples)],0.5)) + theme_bw() + theme(axis.title=element_text(size=16), axis.text=element_text(size=14)) p # Export as a pdf and tiff tiff_exp <- paste0(sample, "_PCA_pre_filtration.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "_PCA_pre_filtration.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # UMAP plot, samples combined p <- DimPlot(data1, reduction = "umap", group.by = 'orig.ident', pt.size = 2, cols = alpha(c25[1:length(samples)], 0.5)) + theme_bw() + theme(legend.position="none", axis.title=element_text(size=16), axis.text=element_text(size=14)) p # Or as faceted plot (as above, with PCA) p <- DimPlot(data1, reduction = "umap", split.by = 'orig.ident', group.by = 'orig.ident', pt.size = 2, cols = alpha(c25[1:length(samples)],0.5)) + theme_bw() + theme(axis.title=element_text(size=16), axis.text=element_text(size=14)) p # Export as a pdf and tiff tiff_exp <- paste0(sample, "_umap_pre_filtration.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "_umap_pre_filtration.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # tSNE plot, samples combined p <- DimPlot(data1, reduction = "tsne", group.by = 'orig.ident', pt.size = 2, cols = alpha(c25[1:length(samples)], 0.5)) + theme_bw() + theme(legend.position="none", axis.title=element_text(size=16), axis.text=element_text(size=14)) p # Or as faceted plot p <- DimPlot(data1, reduction = "tsne", split.by = 'orig.ident', group.by = 'orig.ident', pt.size = 2, cols = alpha(c25[1:length(samples)],0.5)) + theme_bw() + theme(axis.title=element_text(size=16), axis.text=element_text(size=14)) p # Export as a pdf and tiff tiff_exp <- paste0(sample, "_tsne_pre_filtration.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "_tsne_pre_filtration.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")
4d. Visualise gene expression by marker
This section allows you to produce a side-by-side visualisation of gene expression for your aggregated samples, for specific genes.
#### 4d Visualise gene expression by marker #### # Dimensionality plots # Generate PCA, UMAP or t-SNE plots for selected markers, with side-by-side plots for each sample. # Enter which markers you want to examine (can be one or many markers): markers <- c("P2ry12", "Tmem119", "Itgam") # Select which type of plot you want to generate ("pca", "umap" or "tsne"): ## **USER INPUT** redplot <- "pca" # Generate the plot p <- FeaturePlot(data1, features = markers, reduction = redplot, cols = c("lightgrey", "red"), split.by = 'orig.ident', pt.size = 1) p # Export as a pdf and tiff tiff_exp <- paste0(sample, "_", redplot, "_markers_filtered.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "_", redplot, "_markers_filtered.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # Violin plot # Here you can visualise the relative expression of markers on each cluster. # NOTE: This plot is specifically for comparing 2 samples. This plot may not be informative if you have more than 2 samples in your aggregte dataset # Generate the plot p <- VlnPlot(data1, features = markers, split.plot = TRUE, split.by = 'orig.ident', cols = c25[1:length(samples)]) # Export as a pdf and tiff tiff_exp <- paste0(sample, "_markers_violin.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "_markers_violin.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # Heatmap # You can also visualise differences in marker expression between samples with a heatmap # Generate the heatmap p <- DoHeatmap(data1, features = markers, raster = T, group.by = 'orig.ident') + scale_fill_gradientn(colors = c("darkorange", "floralwhite", "dodgerblue4")) + theme(text = element_text(size = 16)) + labs(color = "Dose (mg)") p # Export as a pdf and tiff tiff_exp <- paste0(sample, "_hmap_markers.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "_hmap_markers.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")
4e. Differential expression between samples
In the next section (section 5) we examine differential expression between clusters in individual samples. In this section we examine differential expression between samples.
We’ll be using the Seurat function FindMarkers
to identify differentially expressed genes. This may require some modification (in particular the logfc.threshold
parameter), depending on how many DE genes in your dataset and how significant they are.
By default, we’ve set logfc.threshold = 0.2
. This only tests genes with at least 0.2 log fold difference in expression and speeds up the analysis considerably. But it could also remove some significant DE genes.
To test for this, look at the bottom 6 genes (ordered by p value) by running the tail(DE_genes)
command (in the script below). If these genes are all non-significant (i.e. p_val_adj > 0.05) then you have captured all significant genes. If all these genes are significant (p_val_adj < 0.05) then re-run FindMarkers()
with logfc.threshold = 0.1
. This will take much longer to run, but should then capture all DE genes (if not, reduce logfc.threshold = 0
, but will take a very long time to run).
#### 4e. Differential expression between samples #### # You will need to choose two of your samples to compare, based on your aggregated dataset. # To see which samples you've aggregated: samples # Now select the first sample (which will be the baseline sample): ## **USER INPUT** desamp1 <- "Cerebellum" # Then select the second (comparison) sample: ## **USER INPUT** desamp2 <- "Cortex" # Then run the Seurat differential expression function: DE_genes <- FindMarkers(data1, group.by = 'orig.ident', ident.1 = desamp1, ident.2 = desamp2, logfc.threshold = 0.2) # We can see how many 'differentially expressed' genes this produced by: nrow(DE_genes) # As mentioned in the descriptive text for this section, using 'logfc.threshold = 0.2' may not find all DE genes # Run the following: tail(DE_genes) # If all the listed genes are non-significant (i.e. p_val_adj < 0.05) then you have captured all genes # If not, change to 'logfc.threshold = 0.1' and re-run the 'DE_genes <- FindMarkers(..)' line # Run 'tail(DE_genes)' to again see if you've captured all DE genes # If you still have significant genes in your list, re-run with 'logfc.threshold = 0' # This will take a long time to run, but is guaranteed to capture all DE genes # You can also view the top 20 DE genes by: head(DE_genes, 20) # You can extract just the **significantly** differentially expressed genes (adjusted p < 0.05) like so: DE_genes_sig <- DE_genes[DE_genes$p_val_adj < 0.05, ] # See how many significantly DE genes there are: nrow(DE_genes_sig) # You can also extract genes based on log fold change as well. # Select a log fold change threshold here: ## **USER INPUT** lfc_threshold <- 0.3 # Then filter your data by this log fold change threshold. DE_genes_sig_lfc <- DE_genes_sig[DE_genes_sig$avg_log2FC < -lfc_threshold | DE_genes_sig$avg_log2FC > lfc_threshold, ] # Then see how many sig DE genes remain after lfc filtration: nrow(DE_genes_sig_lfc) # **NOTE** You can easily filter out all your results here by using too high a lfc threshold. # You can see the maximum log fold change for all DE genes like so: max(DE_genes$avg_log2FC) # And you can view the minimum as well. # Use this min/max log fold change information to decide on a suitable log fold change threshold. # This can vary greatly, depending on your data. min(DE_genes$avg_log2FC) # You can export the DE genes table to your working directory as a csv file that can be opened in Excel: write.csv(DE_genes_sig_lfc, paste0("DE_", desamp1, "_vs_", desamp2, ".csv")) ## Visualising DE results ## # There are a variety of ways to visualise your DE results. # Below are a few examples and more can be added to this workflow as needed. # Not all of these methods have the space to plot all DE genes, so we can provide them with a list of DE genes to plot: # The below pulls out the top 20 most significantly DE genes, by adjusted p value. # You can change this number '[1:20]' to whatever number of top genes you prefer # You can also plot specific genes by name (i.e., choose genes of interest from the table of DE genes), # e.g. plotgenes <- c("pDC", "Eryth", "Mk", "DC"). You can include as many genes as you like. ## **USER INPUT** plotgenes <- rownames(DE_genes_sig)[1:20] # Scatter plot of individual genes # You can select `reduction =` to be either `"pca"` for PCA plot, or `"umap"` or `"tsne"`. You can also change the colours or point sizes. p <- FeaturePlot(data1, features = plotgenes[1:10], cols = c("lightgrey", "red"), reduction = "pca", pt.size = 1) p # Export as a pdf and tiff tiff_exp <- paste0(desamp1, "_vs_", desamp2, "_DE_genes.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(desamp1, "_vs_", desamp2, "_DE_genes.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # Dot plot p <- DotPlot(data1, features = plotgenes, group.by = 'orig.ident', dot.scale = 12, cols = c("lightgrey", "red")) + RotatedAxis() + ylab("Sample") + xlab("Genes") + theme(text = element_text(size = 18)) p # Export as a pdf and tiff tiff_exp <- paste0(desamp1, "_vs_", desamp2, "_DE_genes_dotplot.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(desamp1, "_vs_", desamp2, "_DE_genes_dotplot.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # Violin plots of individual genes p <- VlnPlot(data1, features = plotgenes[1:10], split.plot = TRUE, split.by = 'orig.ident', cols = c25, pt.size = 0) p # Export as a pdf and tiff tiff_exp <- paste0(desamp1, "_vs_", desamp2, "_DE_genes_violin.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(desamp1, "_vs_", desamp2, "_DE_genes_violin.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # Heatmap p <- DoHeatmap(data1, features = plotgenes, raster = T, group.by = 'orig.ident') + scale_fill_gradientn(colors = c("darkorange", "floralwhite", "dodgerblue4")) + theme(text = element_text(size = 16)) p # Export as a pdf and tiff tiff_exp <- paste0(desamp1, "_vs_", desamp2, "_DE_genes_heatmap.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(desamp1, "_vs_", desamp2, "_DE_genes_heatmap.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")
5. Differential expression
This section examines significantly differentially expressed genes for each cluster, for a selected sample. Significantly DE genes are those with false-discovery adjusted p values of < 0.05.
IMPORTANT: To run this section, you must have processed your sample(s) by completing the '3.Analysis using the seurat package' section. That section removes outliers and non-target cells and identifies clusters. This differential expression analysis uses these filtered cells and cluster information. Section 3 only has to run once for each sample, as it outputs a datafile for each sample that is imported into this section.
5a. Select a sample to work with and import sample data
#### 5a. Select a sample to work with and import sample data #### # Find any sample datasets you generated by running section 3 dir(pattern = "seurat_filtered.rds") # Choose the sample name you wish to work with from the list above (just the name, without the '_seurat_filtered.rds'): ## **USER INPUT** sample <- "Cerebellum" # If you want to analyse a different sample, simply come back to this section, change the sample name, then re-run the following sections. # Now import the data file for that sample: data <- readRDS(paste0(sample, "_seurat_filtered.rds")) # See a summary of your data object: data
5b. Explanation of differential expression strategies
Bulk RNA-Seq usually examines differentially expressed genes between two treatments or tissues. scRNA-Seq is somewhat different, in that it is based on heterogeneity of gene expression within a group of cells, so differential expression can also be examined between clusters.
Given that there are usually multiple clusters, and differential expression analysis requires a comparison between two groups, there are different strategies used to combine the two group analysis with the multiple cluster analysis. Three strategies included in this workflow include:
Comparing a single cluster to all other clusters/cells (section 5c). This enables one to examine how gene expression in a single cluster differs from the entire dataset.
Comparing one cluster to another cluster (section 5d). This requires two clusters being initially chosen, then examining how gene expression differs between them. Note that to do every possible combination of cluster comparisons may require running this section many times. E.g. if your sample has 8 clusters, this represents 7+6+5+4+3+2+1 = you would have to run that section 28 times to compare every cluster to every other cluster.
Doing a batch comparison of each cluster to all other clusters/cells (section 5e). Seurat has a function (
FindAllMarkers()
) to accomplish this, but this can take some time to run, as multiple clusters are being compared at once, and it also produces a multilevel dataset. This is an alternative to running each cluster one at a time, as in section 5c.
IMPORTANT: the first cluster you choose in each analysis below represents the baseline or control group. Genes are either upregulated (i.e. positive log fold change) or downregulated (-ve lfc) in comparison to this baseline group. This means that any downregulated genes (i.e. negative log fold change) are more highly expressed in this baseline cluster, but any upregulated genes (+ lfc) are more highly expressed in the cluster(s) being compared to this baseline cluster.
5c. Differential expression: one cluster vs all other cells
In this section you'll select a single cluster, then compare gene expression in this cluster to all other combined clusters. Any upregulated genes (positive log fold change) are more highly expressed in the combined clusters.
#### 5c. Differential expression: one cluster vs all other cells #### # First, we should have a look at the clusters we have to work with: levels(x = data) # You'll need to select one of these clusters as the baseline cluster (`declust <-`). # You can re-run the below code with a different baseline cluster if you want to look at other comparisons: ## **USER INPUT** declust <- 0 # Then run the Seurat differential expression function: DE_genes <- FindMarkers(data, ident.1 = declust, logfc.threshold = 0.2) # NOTE: modify the 'logfc.threshold =' parameter so that you've captured all DE genes. # See section '4e. Differential expression between samples' for an explanation # Look at the bottom 6 genes, to see if you've captured all DE genes. If not, lower 'logfc.threshold =' and re-run tail(DE_genes) # We can see how many genes this found by: nrow(DE_genes) # View the top 20 DE genes by: head(DE_genes, 20) # You can extract just the **significantly** differentially expressed genes (adjusted p < 0.05) like so: DE_genes_sig <- DE_genes[DE_genes$p_val_adj < 0.05, ] # See how many significantly DE genes there are nrow(DE_genes_sig) # You can also extract genes based on log fold change as well. Enter a log fold change threshold here (e.g. lfc_threshold <- 1): # By default, this is zero, which will not apply any log fold change filtration lfc_threshold <- 0 # Then filter your data by this log fold change threshold. DE_genes_sig_lfc <- DE_genes_sig[DE_genes_sig$avg_log2FC < -lfc_threshold | DE_genes_sig$avg_log2FC > lfc_threshold, ] # See how many sig DE genes remain after lfc filtration: nrow(DE_genes_sig_lfc) # **NOTE** You can easily filter out all your results here by using too high a lfc threshold. # Export the DE genes table to your working directory as a csv file that can be opened in Excel: write.csv(DE_genes_sig_lfc, paste0("DE_1vsALL_clust_", declust, "_", sample, ".csv")) ## Visualising DE results ## # There are a variety of ways to visualise your DE results. # Below are a few examples and more can be added to this workflow as needed. # Not all of these methods have the space to plot all DE genes, so we can provide them with a list of DE genes to plot: # The below pulls out the top 20 most significantly DE genes, by adjusted p value. # You can change this number '[1:20]' to whatever number of top genes you prefer # You can also plot specific genes by name (i.e., choose genes of interest from the table of DE genes), # e.g. plotgenes <- c("pDC", "Eryth", "Mk", "DC"). You can include as many genes as you like. ## **USER INPUT** plotgenes <- rownames(DE_genes_sig)[1:20] # Scatter plot of individual genes # You can select `reduction =` to be either `"pca"` for PCA plot, or `"umap"` or `"tsne"`. You can also change the colours or point sizes. p <- FeaturePlot(data, features = plotgenes, cols = c("lightgrey", "red"), reduction = "pca", pt.size = 1) p # Export as a pdf and tiff tiff_exp <- paste0(sample, "clust_", declust, "_DE_genes.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "clust_", declust, "_DE_genes.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # Dot plot p <- DotPlot(data, features = plotgenes, dot.scale = 8, cols = c("lightgrey", "red")) + RotatedAxis() + ylab("Cluster") + xlab("Genes") + theme(text = element_text(size = 18)) p # Export as a pdf and tiff tiff_exp <- paste0(sample, "clust_", declust, "_DE_genes_dotplot.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "clust_", declust, "_DE_genes_dotplot.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # Violin plots of individual genes p <- VlnPlot(data, features = plotgenes, cols = c25, pt.size = 0) p # Export as a pdf and tiff tiff_exp <- paste0(sample, "clust_", declust, "_DE_genes_violin.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "clust_", declust, "_DE_genes_violin.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # Heatmap p <- DoHeatmap(data, features = plotgenes, raster = T) + scale_fill_gradientn(colors = c("darkorange", "floralwhite", "dodgerblue4")) + theme(text = element_text(size = 16)) p # Export as a pdf and tiff tiff_exp <- paste0(sample, "clust_", declust, "_DE_genes_heatmap.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "clust_", declust, "_DE_genes_heatmap.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # You can go back to the start of this section and change 'declust <- ' to another cluster to examine DE for that cluster.
5d. Differential expression: one cluster vs another cluster
#### 5d.Differential expression: one cluster vs another cluster #### # In this section you'll select to clusters to compare. Any upregulated genes (postive log fold change) are more highly expressed in the second cluster. # First, select your baseline cluster: ## **USER INPUT** declust <- 0 # Then, the cluster you want to compare it to: ## **USER INPUT** declust2 <- 1 # Then run the Seurat DE function. DE_genes <- FindMarkers(data, ident.1 = declust, ident.2 = declust2, logfc.threshold = 0.2) # As in the previous section, you can run `tail(DE_genes)` to see if you need to reduce `logfc.threshold = 0.2` so as to capture all **significantly** DE genes. tail(DE_genes) # And again you can filter by p value and log fold change (i.e. change `p_val_adj` and `lfc_threshold` to suitable numbers), count the number of sig DE genes and view the top few. DE_genes_sig <- DE_genes[DE_genes$p_val_adj < 0.05, ] lfc_threshold <- 0. DE_genes_sig_lfc <- DE_genes_sig[DE_genes_sig$avg_log2FC < -lfc_threshold | DE_genes_sig$avg_log2FC > lfc_threshold, ] nrow(DE_genes_sig_lfc) head(DE_genes_sig_lfc) # Export the table of DE genes as a csv file: write.csv(DE_genes_sig_lfc, paste0("DE_1vs1_clust", declust, "_vs_clust", declust2, "_", sample, ".csv")) ## Plotting ## # Pull out the top 20 genes to plot (or change to the range or names of genes you want to plot, as in the previous section.) plotgenes <- rownames(DE_genes_sig)[1:20] # Scatter plot p <- FeaturePlot(data, features = plotgenes, cols = c("red", "lightgrey"), reduction = "pca", pt.size = 1) p # Export as a pdf and tiff tiff_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # Dot plot p <- DotPlot(data, features = plotgenes, dot.scale = 8, cols = c("red", "lightgrey")) + RotatedAxis() + ylab("Cluster") + xlab("Genes") + theme(text = element_text(size = 18)) p # Export as a pdf and tiff tiff_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes_dotplot.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes_dotplot.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # Violin plots p <- VlnPlot(data, features = plotgenes, cols = c25, pt.size = 0) p # Export as a pdf and tiff tiff_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes_violin.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes_violin.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # Heatmap p <- DoHeatmap(data, features = plotgenes, raster = T) + scale_fill_gradientn(colors = c("darkorange", "floralwhite", "dodgerblue4")) + theme(text = element_text(size = 16)) p # Export as a pdf and tiff tiff_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes_heatmap.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") pdf_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes_heatmap.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # You can go back to the start of this section and change declust <- and declust2 <- to compare different clusters.
5e. Differential expression: every cluster vs all other cells
#### 5e. Differential expression: every cluster vs all other cells #### # In section 5c we compared a chosen cluster to all other clusters/cells. In this section we can automate this somewhat and do this for every cluster at the same time. # Unlike the other DE section, you don't need to choose any clusters to compare, just run the Seurat function: DE_genes <- FindAllMarkers(data, logfc.threshold = 0.2) # As with the previous sections, use `tail()` to see if you've captured all sig DE genes: tail(DE_genes[order(DE_genes$p_val_adj, decreasing = F),]) # Now filter by adjusted p and lfc (enter your own parameters), then re-order by cluster: DE_genes_sig <- DE_genes[DE_genes$p_val_adj < 0.05, ] DE_genes_sig <- DE_genes_sig[order(DE_genes_sig$cluster),] lfc_threshold <- 0.3 DE_genes_sig_lfc <- DE_genes_sig[DE_genes_sig$avg_log2FC < -lfc_threshold | DE_genes_sig$avg_log2FC > lfc_threshold, ] # You can see how many DE genes per cluster were found: summary(DE_genes_sig_lfc$cluster) # Then export this table of DE genes per cluster as a csv file: write.csv(DE_genes_sig, paste0("DE_eachclustVSall_", sample, ".csv"))