Aim
Analyse 10x Genomics* single cell RNA-Seq data.
This analysis workflow is split into 2 main sections: 1) ‘Upstream’ analysis on QUT’s HPC (high performance compute cluster) using a Nextflow workflow, nfcore/scrnaseq, 2) ‘Downstream’ analysis in R, primarily using the package seurat.
*Note: this workflow can be adapted to work with scRNA-Seq datasets generated by other sequencing technologies than 10x Genomics.
Requirements
A HPC account. If you do not have one, please request one here.
Access to an rVDI virtual desktop machine with 64GB RAM. For information about rVDI and requesting a virtual machine, see here.
Nextflow installed on your HPC home account. If you haven’t already installed Nextflow, do so by following the guide here.
Your scRNA-Seq data (fastq files) are on the HPC. If you are having difficulties transferring them to the HPC, submit a support ticket here.
Connect to an rVDI virtual desktop machine
Ensure you already have access to a 64GB RAM rVDI virtual machine, or request access by following the guide here.
scRNA-Seq datasets are often very large, requiring a lot of memory to run. Downstream analysis is run in R, on a Windows machine. Your PC is unlikely to have enough RAM, thus we’re using virtual machines with 64GB RAM. In addition, you can run the rest of this analysis in the virtual machine.
To access and run your rVDI virtual desktop:
Go to https://rvdi.qut.edu.au/
Click on ‘VMware Horizon HTML Access’
Log on with your QUT username and password
*NOTE: you need to be connected to the QUT network first, either being on campus or connecting remotely via VPN.
Set up your rVDI environment
Now that you’ve connected to an rVDI virtual machine, you’ll need to set it up to:
Connect to your home directory on the HPC, so you can access your data files
Install R and RStudio for running the Seurat analysis
Connect to your HPC home directory
Using the Windows File Explorer, we can map our HPC Home folder and the shared Work folder to drive letters. Here we will map our home drive to ‘H' and our shared work directory to ‘W’.
Open File Explorer (folder icon in the Windows task bar).
Right click “This PC” and choose Map Network Drive.
Home drive: select ‘H' as the drive letter, then copy and paste \\hpc-fs\home into the Folder box. Click 'Finish’.
Work drive: In file Explorer, again right click “This PC” and choose Map Network Drive. select ‘W' as the drive letter, then copy and paste \\hpc-fs\work into the Folder box. Click 'Finish’.
To see this demonstrated watch this video:
https://mediahub.qut.edu.au/media/t/0_ylaejs40
Now you’ll be able to browse and copy files between your virtual Windows machine and the HPC.
Installing R and RStudio
Seurat is an R package, so first we need to install R and RStudio on the rVDI machine. You will be copying and pasting script from this workflow into RStudio.
Download and install R, following the default prompts:
Download R-4.3.2 for Windows. The R-project for statistical computing.
Download and install RStudio, following the default prompts:
https://posit.co/download/rstudio-desktop/
1. nfcore/scrnaseq
10x scRNA-Seq data is typically processed using various Cell Ranger software tools. These (and other) tools have been combined in an nfcore Nextflow workflow called scrnaseq.
NOTE: sometimes your 10x data has already been processed by your sequencing company, using Cell Ranger. In this case you can skip the nfcore/scrnaseq analysis and go straight to the downstream Seurat analysis.
Workflow overview
As can be seen in the workflow below, there are several workflow options. The one we’ll be using is the complete Cell Ranger workflow, using the tools cellranger mkgtf and cellranger mkref for reference genome preparation and cellranger count for both aligning sequences to the reference genome and quantifying expression per gene per cell, for each sample.
2. Seurat
Seurat is:
A toolkit for quality control, analysis, and exploration of single cell RNA sequencing data. 'Seurat' aims to enable users to identify and interpret sources of heterogeneity from single cell transcriptomic measurements, and to integrate diverse types of single cell data.
It is also an R package, so we will be using RStudio (which you installed earlier) to run the analysis script.
Set your working directory
In R, your working directory is where your data files are read in to R from and where any output files are deposited. For our purposes we need to set the working directory to the location on the HPC where your scRNASeq dataset is.
Most of the scripts can be run without modification, but there are a few lines that you will need to change, such as the working directory (which will differ for each researcher’s dataset).
When you see **USER INPUT** in the script, this means you have to modify the line below this.
You can manually set your working directory in RStudio by selecting ‘Session' -> 'Set working directory' -> 'Choose directory' and then pointing it to the directory this script is in on your PC. This will output the 'setwd()’ command with your working directory into the console window (bottom left panel). Copy this command to replace the default one in the code below.
Copy and paste (then run) this code into your R script (same with the code in all following sections as well).
#### Set your working directory #### # **USER INPUT** setwd("C:/Users/whatmorp/OneDrive - Queensland University of Technology/Desktop/Manuscripts in progress/Fazeleh scRNA-Seq/New_analysis_script") # You can see the sample subdirectories by: list.dirs(full.names = F, recursive = F)
Installing packages
Note: this section install several R packages and their dependencies. It will take several minutes to run.
#### Installing required packages #### # Create vector of required package names bioconductor_packages <- c("clusterProfiler", "pathview", "AnnotationHub", "org.Mm.eg.db") cran_packages <- c("Seurat", "patchwork", "ggplot2", "tidyverse", "viridis", "plyr", "readxl", "scales") # Compares installed packages to above packages and returns a vector of missing packages new_packages <- bioconductor_packages[!(bioconductor_packages %in% installed.packages()[,"Package"])] new_cran_packages <- cran_packages[!(cran_packages %in% installed.packages()[,"Package"])] # Install missing bioconductor packages if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(new_packages) # Install missing cran packages if (length(new_cran_packages)) install.packages(new_cran_packages, repos = "http://cran.us.r-project.org") # Update all installed packages to the latest version update.packages(bioconductor_packages, ask = FALSE) update.packages(cran_packages, ask = FALSE, repos = "http://cran.us.r-project.org")
Loading packages
#### Loading required packages #### # This section needs to be run every time # Load packages bioconductor_packages <- c("clusterProfiler", "pathview", "AnnotationHub", "org.Mm.eg.db") cran_packages <- c("Seurat", "patchwork", "ggplot2", "tidyverse", "viridis", "plyr", "readxl", "scales") lapply(cran_packages, require, character.only = TRUE) lapply(bioconductor_packages, require, character.only = TRUE)
Cell Ranger (and nfcore/scrnaseq) generates a default directory and file output structure for each sample, which we’ll use in R to complete our analysis. Each sample will have a directory named after the sample, an ‘outs’ subdirectory under this. This ‘outs’ directory contains various files and subdirectories. The subdirectory that contains the count matrix data we need for Seurat analysis is called ‘filtered_feature_bc_matrix’.