Aim
Analyse 10x Genomics* single cell RNA-Seq data.
This analysis workflow is split into 2 main sections: 1) ‘Upstream’ analysis on QUT’s HPC (high performance compute cluster) using a Nextflow workflow, nfcore/scrnaseq, 2) ‘Downstream’ analysis in R, primarily using the package seurat.
*Note: this workflow can be adapted to work with scRNA-Seq datasets generated by other sequencing technologies than 10x Genomics.
Requirements
A HPC account. If you do not have one, please request one here.
Access to an rVDI virtual desktop machine with 64GB RAM. For information about rVDI and requesting a virtual machine, see here.
Nextflow installed on your HPC home account. If you haven’t already installed Nextflow, do so by following the guide here.
Your scRNA-Seq data (fastq files) are on the HPC. If you are having difficulties transferring them to the HPC, submit a support ticket here.
Connect to an rVDI virtual desktop machine
Ensure you already have access to a 64GB RAM rVDI virtual machine, or request access by following the guide here.
scRNA-Seq datasets are often very large, requiring a lot of memory to run. Downstream analysis is run in R, on a Windows machine. Your PC is unlikely to have enough RAM, thus we’re using virtual machines with 64GB RAM. In addition, you can run the rest of this analysis in the virtual machine.
To access and run your rVDI virtual desktop:
Go to https://rvdi.qut.edu.au/
Click on ‘VMware Horizon HTML Access’
Log on with your QUT username and password
*NOTE: you need to be connected to the QUT network first, either being on campus or connecting remotely via VPN.
Set up your rVDI environment
Now that you’ve connected to an rVDI virtual machine, you’ll need to set it up to:
Connect to your home directory on the HPC, so you can access your data files
Install R and RStudio for running the Seurat analysis
Connect to your HPC home directory
Using the Windows File Explorer, we can map our HPC Home folder and the shared Work folder to drive letters. Here we will map our home drive to ‘H' and our shared work directory to ‘W’.
Open File Explorer (folder icon in the Windows task bar).
Right click “This PC” and choose Map Network Drive.
Home drive: select ‘H' as the drive letter, then copy and paste \\hpc-fs\home into the Folder box. Click 'Finish’.
Work drive: In file Explorer, again right click “This PC” and choose Map Network Drive. select ‘W' as the drive letter, then copy and paste \\hpc-fs\work into the Folder box. Click 'Finish’.
To see this demonstrated watch this video:
https://mediahub.qut.edu.au/media/t/0_ylaejs40
Now you’ll be able to browse and copy files between your virtual Windows machine and the HPC.
Installing R and RStudio
Seurat is an R package, so first we need to install R and RStudio on the rVDI machine. You will be copying and pasting script from this workflow into RStudio.
Download and install R, following the default prompts:
Download R-4.3.2 for Windows. The R-project for statistical computing.
Download and install RStudio, following the default prompts:
https://posit.co/download/rstudio-desktop/
1. nfcore/scrnaseq
10x scRNA-Seq data is typically processed using various Cell Ranger software tools. These (and other) tools have been combined in an nfcore Nextflow workflow called scrnaseq.
NOTE: sometimes your 10x data has already been processed by your sequencing company, using Cell Ranger. In this case you can skip the nfcore/scrnaseq analysis and go straight to the downstream Seurat analysis.
1a. Workflow overview
As can be seen in the workflow below, there are several workflow options. The one we’ll be using is the complete Cell Ranger workflow, using the tools cellranger mkgtf and cellranger mkref for reference genome preparation and cellranger count for both aligning sequences to the reference genome and quantifying expression per gene per cell, for each sample.
1b. Creating a samplesheet
To run, nfcore/scrnaseq requires: 1) Your data files in gzipped fastq format (*fastq.gz), 2) A samplesheet that lists the sample names and the fastq files associated with each sample.
See this page for the required structure and content of the samplesheet:
https://nf-co.re/scrnaseq/2.4.1/docs/usage
Because sample names are specific to a project, and typically a single samples is associated with multiple fastq files, you will need to manually create this samplesheet. You can create it in Excel, then save it as a comma-separated file called ‘samplesheet.csv’, then copy this up to the HPC. Or you can manually create it on the command-line in the HPC, using a text editor such as nano.
Note: in the samplesheet, you must provide the full path for the fastq files. This is not shown in the nfcore/scrnaseq usage guide.
You can find the full path by typing pwd
in the command-line, while in the directory containing your fastq files.
For example, if your fastq files are in /home/username/mydata
and you have a fastq file called Liver_S2_L001_I1_001.fastq.gz
then the full path for that fastq file would be: /home/username/mydata/Liver_S2_L001_R1.fastq.gz
An example datasheet, with 2 samples (Liver and Kidneys), where each sample has 2 fastq files associated with it, might look something like this:
sample, | fastq_1, | fastq_2, |
---|---|---|
Liver, | /home/username/mydata/Liver_L001_R1.fastq.gz, | /home/username/mydata/Liver_L001_R1.fastq.gz, |
Liver, | /home/username/mydata/Liver_L002_R1.fastq.gz, | /home/username/mydata/Liver_L002_R1.fastq.gz, |
Kidney, | /home/username/mydata/Kidney_L001_R1.fastq.gz, | /home/username/mydata/Kidney_L001_R1.fastq.gz, |
Kidney, | /home/username/mydata/Kidney_L001_R1.fastq.gz, | /home/username/mydata/Kidney_L001_R1.fastq.gz, |
1c. Running nfcore/scrnaseq as a PBS script
2. Downstream analysis with Seurat
Seurat is:
A toolkit for quality control, analysis, and exploration of single cell RNA sequencing data. 'Seurat' aims to enable users to identify and interpret sources of heterogeneity from single cell transcriptomic measurements, and to integrate diverse types of single cell data.
It is also an R package, so we will be using RStudio (which you installed earlier) to run the analysis script.
2a. Open RStudio and create a new R script
RStudio is a GUI (graphical user interface) for R. It makes navigating R easier.
Open RStudio (you can type it in the Windows search bar)
Create a new R script: ‘File’ → “New File” → “R script”
Save this script where your samples folders are (‘File’ → ‘Save’). These should be on your H or W drive. Save the script file as ‘scrnaseq.R’
In the following sections you will be copying and running the R code into your scrnaseq.R script.
Cell Ranger (and nfcore/scrnaseq) generates a default folder and file output structure. There will be a main folder that contains all the sample subfolders (NOTE: this is where you must save your R script). Each sample folder will have an ‘outs’ subfolder. This ‘outs’ folder contains a ‘filtered_feature_bc_matrix’ folder, which contains the files that Seurat uses in its analysis.
2b. Set your working directory
In R, your working directory is where your data files are read in to R from and where any output files are deposited. For our purposes we need to set the working directory to the location on the HPC where your scRNASeq dataset is.
Most of the scripts can be run without modification, but there are a few lines that you will need to change, such as the working directory (which will differ for each researcher’s dataset).
When you see **USER INPUT** in the script, this means you have to modify the line below this.
You can manually set your working directory in RStudio by selecting ‘Session' -> 'Set working directory' -> 'Choose directory'. Choose the same directory as you saved your scrnaseq.R script, previous section. This will output the setwd(...)
command with your working directory into the console window (bottom left panel). Copy this command to replace the default setwd(...)
line in your R script.
Copy and paste the following code into the R script you just created, then run the code (highlight all the code in your R script, then press the run button).
#### 2b. Set your working directory #### # Change the below to the directory that contains your sample folders (you may have to browse H or W drive to find this) # **USER INPUT** setwd("H:/sam_dando/dataset1/count") # You can see the sample subdirectories by: list.dirs(full.names = F, recursive = F) # You should see directories that are names after your samples. # If you don't see this, browse through your H or W drives to find the correct path for your sample directories.
2c. Installing packages
This will install all the required packages and dependencies and may take 30 minutes or more to complete. It may prompt you occasionally to update packages - select 'a' for all if/when this occurs.
#### 2c. Installing required packages #### # This section only needs to be run once on a computer. # One the packages are installed, they need to be loaded every time they will be used (next section) # Create vector of required package names bioconductor_packages <- c("clusterProfiler", "pathview", "AnnotationHub", "org.Mm.eg.db") cran_packages <- c("Seurat", "patchwork", "ggplot2", "tidyverse", "viridis", "plyr", "readxl", "scales") # Compares installed packages to above packages and returns a vector of missing packages new_packages <- bioconductor_packages[!(bioconductor_packages %in% installed.packages()[,"Package"])] new_cran_packages <- cran_packages[!(cran_packages %in% installed.packages()[,"Package"])] # Install missing bioconductor packages if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(new_packages) # Install missing cran packages if (length(new_cran_packages)) install.packages(new_cran_packages, repos = "http://cran.us.r-project.org") # Update all installed packages to the latest version update.packages(bioconductor_packages, ask = FALSE) update.packages(cran_packages, ask = FALSE, repos = "http://cran.us.r-project.org")
2d. Loading packages
#### 2d. Loading required packages #### # This section needs to be run every time # Load packages bioconductor_packages <- c("clusterProfiler", "pathview", "AnnotationHub", "org.Mm.eg.db") cran_packages <- c("Seurat", "patchwork", "ggplot2", "tidyverse", "viridis", "plyr", "readxl", "scales") lapply(cran_packages, require, character.only = TRUE) lapply(bioconductor_packages, require, character.only = TRUE)
2e. Select a sample to work with and import the data into R
#### 2e. Choose a sample to work with and import the data for that sample into R #### # Give the sample name here that you want to work with. ## **USER INPUT** sample <- "Cerebellum" # To see the available samples (choose a sample name from this list): list.dirs(full.names = F, recursive = F) # Use Seurat's 'Read10X()' function to read in the full sample database. Cell Ranger creates 3 main database files that need to be combined into a single Seurat object. # Note: these datasets can be very large and take several minutes to import into R. mat <- Read10X(data.dir = paste0(sample, "/outs/filtered_feature_bc_matrix")) # Have a look at the top 10 rows and columns to see if the data has been imported correctly. You should see gene IDs as rows and barcodes (i.e. cells) as columns as.matrix(mat[1:10, 1:10]) # Now convert this to a Seurat object. Again, this may take several minutes to load and use a lot of memory mat2 <- CreateSeuratObject(counts = mat, project = sample) # You can see a summary of the data by simply running the Seurat object name mat2 # Set a colour palette that can contrast multiple clusters when you plot them. # You can change these colours as you like. # You can see what R colours are available here: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf c25 <- c("dodgerblue2", "#E31A1C", "green4", "#6A3D9A", "#FF7F00", "black", "gold1", "skyblue2", "#FB9A99", "palegreen2", "#CAB2D6", "#FDBF6F", "gray70", "khaki2", "maroon", "orchid1", "deeppink1", "blue1", "steelblue4", "darkturquoise", "green1", "yellow4", "yellow3", "darkorange4", "brown")
2f. Identify markers in cells
Now we're going to identify individual markers that were present (i.e. were expressed) in our dataset.
IMPORTANT: the gene symbols you provide in the following section have to exactly match the gene symbols in your dataset (including capitalisation).
Gene symbols are more like 'common names' and can vary between databases.
Your main gene identifiers are Ensembl IDs and we need to find the gene symbols that match these Ensembl IDs.
For example, the gene P2ry12 is also called ADPG-R, BDPLT8, HORK3 and various other IDs, depending on the database it's listed in.
In the Ensemble database it's listed as P2ry12 (not P2RY12, remember, case matters) and matches Ensembl ID ENSMUSG00000036353.
For this reason it's advisable to first search the Ensembl website for your markers of interest and for your organism, to ensure you are providing gene symbols that match the Ensembl IDs.
https://asia.ensembl.org/Mus_musculus/Info/Index
#### 2f. Idenitify markers in cells #### # Create a vector called 'markers' that contains each of the markers you want to examine. # These should be gene symbols. Replace the gene symbols below with your target markers. ## **USER INPUT** markers <- c("P2ry12", "Tmem119", "Itgam") # You can see if the markers you provided are present: sum(row.names(mat) %in% markers) # If you input 3 markers and the output from the above code = 3, then all are present. # If the result is 2 then 2 of the 3 markers you provided are found in your data, etc. ## USER INPUT # You can see if an individual marker is present like so (substitute for a marker of choice): sum(row.names(mat) == "P2ry12") # Outputs 1 if the marker is present, 0 if it isn't # Pull out just the read counts for your defined markers y <- mat[row.names(mat) %in% markers, ] # Now we can count the number of cells containing zero transcripts for each of the examined markers. # This enables an examination of the number of cells that have zero expression for these markers, # and therefore the number of cells that can be considered non-target cells. # First count all cells # Then make a loop to cycle through all markers (defined in previously created 'markers' vector) a <- length(colnames(y)) for (i in 1:length(markers)) { a <- c(a, sum(y[i,] == 0)) } # Do a sum of the columns y2 <- colSums(y) # See if any zeros. If so, these cells are not target cells (as determined by absence of any target cell markers) count <- c(a, sum(y2 == 0)) # Name the vector elements names(count) <- c("Total_cells", markers, "All_zero") # Generate the table as.data.frame(count) # The above table shows the total number of cells for your sample, # the number of cells which had 0 expression for each marker, # and the number of cell that had zero expression for all of the markers you provided.
2g. Processing expression data (dimensionality reduction)
There are a variety of methods to visualise expression in single cell data. The most commonly used methods - PCA, t-SNE and UMAP - involve 'dimensionality', i.e. converting expression to x-n dimensions (which can then be plotted) based on gene expression per cell.
Seurat can generate and store PCA, t-SNE and UMAP data in the Seurat object we created previously ('mat2'), but first the raw data needs to be processed in a variety of ways:
Normalise the data by log transformation
Identify genes that exhibit high cell-to-cell variation
Scale the data so that highly expressed genes don't dominate the visual representation of expression
Perform the linear dimensional reduction that converts expression to dimensions
Plot the x-y dimension data (i.e. first 2 dimensions)
The first 4 steps are completed in the code below (this may take a few minutes to run)
#### 2g. Processing expression data (dimensionality reduction) #### # Normalise data mat3 <- NormalizeData(mat2) # Identification of variable features mat3 <- FindVariableFeatures(mat3, selection.method = "vst", nfeatures = nrow(mat3)) # Scaling the data all.genes <- rownames(mat3) mat3 <- ScaleData(mat3, features = all.genes) # Perform linear dimensional reduction (PCA) mat3 <- RunPCA(mat3, features = VariableFeatures(object = mat3))
2h. Plot of highly variable genes
Using the FindVariableFeatures
results, we can visualise the most highly variable genes, including a count of variable and non variable genes in your dataset. The below code ouputs the top 10 genes, but you can ajust this number as desired (i.e. in top_genes <- head(VariableFeatures(mat3), 10)
change 10
to another number).
NOTE In the below plot you can change a number of parameters to modify the plot to look how you like. This can be done for any of the plots in these notebooks. In the plot below you can change:
Dot size: pt.size = 2
. Increase or decrease the number to increase or decrease dot size.
Dot colours: cols = c("black", "firebrick"))
. Change the colours to whatever you like. A list of R colour names is here: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
Theme: theme_bw()
. There are several default plot themes you can choose from, that change a variety of plot parameters. See here: https://ggplot2.tidyverse.org/reference/ggtheme.html
Axis text size: theme(text = element_text(size = 17))
. There are a large number of parameters that can be modified with theme()
. Here we've just changed the axis text to size 17. See here for other parameters that can be changed with theme()
: https://ggplot2.tidyverse.org/reference/theme.html
Surat plots are based on the ggplot package. There are a multitude of other modifications you can make to a ggplot, too many to describe in this notebook. But there are plenty of online guides on how to modify ggplot plots. Here's an example: http://www.sthda.com/english/wiki/be-awesome-in-ggplot2-a-practical-guide-to-be-highly-effective-r-software-and-data-visualization
#### 2h. Plot of highly variable genes #### # Identify the 10 most highly variable genes top_genes <- head(VariableFeatures(mat3), 10) # plot variable features with labels p <- VariableFeaturePlot(mat3, pt.size = 2, cols = c("black", "firebrick")) p <- LabelPoints(plot = p, points = top_genes, repel = TRUE) + theme_bw() + theme(text = element_text(size = 17)) p # You can save your plot as a 300dpi (i.e. publication quality) tiff or pdf file. # These files can be found in your working directory. # You can adjust the width and height of the saved images by changing width = and height = in the code below. # Export as a 300dpi tiff tiff_exp <- paste0(sample, "_top_genes.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") # Export as a pdf pdf_exp <- paste0(sample, "_top_genes.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")