2024 eResearch - Session 6 - Hands-on smRNAseq training

Public small RNA-seq data

Species

ENA link

Description

Species

ENA link

Description

Human

https://www.ebi.ac.uk/ena/browser/view/PRJEB5212?show=publications

RNA-seq of micro RNAs (miRNAs) in Human prefrontal cortex to identify differentially expressed miRNAs between Huntington's Disease and control brain samples

1. Connect to an rVDI virtual desktop machine

To access and run an rVDI virtual desktop:

Go to https://rvdi.qut.edu.au/

Click on ‘VMware Horizon HTML Access

Log on with your QUT username and password

*NOTE: you need to be connected to the QUT network first, either being on campus or connecting remotely via VPN.

2. Open PuTTY terminal

  • Click on the PuTTY icon

  • Double click on “Lyra”

  • Fill your password and connect to the HPC

Copying data for hands-on exercises

Before we start using the HPC, let’s start an interactive session:

qsub -I -S /bin/bash -l walltime=10:00:00 -l select=1:ncpus=1:mem=4gb

Get a copy of the scripts to be used in this module

Use the terminal to log into the HPC and create a /RNAseq/ folder to run the nf-core/rnaseq pipeline. For example:

mkdir -p $HOME/workshop/small_RNAseq/scripts cp /work/training/smallRNAseq/scripts/* $HOME/workshop/small_RNAseq/scripts/ ls -l $HOME/workshop/small_RNAseq/scripts/
  • Line 1: The -p indicates create 'parental directories as required. Thus the line 1 command creates both /workshop/ and the subfolder /workshop/scripts/

  • Line 2: Copies all files from /work/datasets/workshop/scripts/ as noted by an asterisk to the newly created folder $HOME/workshop/scripts/

  • Line 3: List the files in the script folder

Copy multiple subdirectories and files using rsync

mkdir -p $HOME/workshop/small_RNAseq/data/ rsync -rv /work/training/smallRNAseq/data/ $HOME/workshop/small_RNAseq/data/
  • Line 1: The first command creates the folder /scripts/

  • Line 2: rsync copies all subfolders and files from the specified source folder to the selected destination folder. The -r = recursively will copy directories and files; -v = verbose messages of the transfer of files

Create a folder for running the nf-core small RNA-seq pipeline

Let’s create a “runs” folder to run the nf-core/rnaseq pipeline:

mkdir -p $HOME/workshop/small_RNAseq mkdir $HOME/workshop/small_RNAseq/run1_test mkdir $HOME/workshop/small_RNAseq/run2_smallRNAseq_human cd $HOME/workshop/small_RNAseq/
  • Lines 1-4: create sub-folders for each exercise

  • Line 5: change the directory to the folder “small_RNAseq”

Exercise 1: Running a test with nf-core sample data

First, let’s assess the execution of the nf-core/rnaseq pipeline by running a test using sample data.

Copy the launch_nf-core_smallRNAseq_test.pbs to the working directory

cd $HOME/workshop/small_RNAseq/run1_test cp $HOME/workshop/small_RNAseq/scripts/launch_nf-core_smallRNAseq_test.pbs .

View the content of the script as follows:

cat launch_nf-core_smallRNAseq_test.pbs

#!/bin/bash -l

#PBS -N nfsmrnaseq

#PBS -l select=1:ncpus=2:mem=4gb

#PBS -l walltime=24:00:00

#work on current directory (folder)

cd $PBS_O_WORKDIR

#load java and set up memory settings to run nextflow

module load java

export NXF_OPTS='-Xms1g -Xmx4g'

# run the test

nextflow run nf-core/smrnaseq -profile test,singularity --outdir results -r 2.1.0

#!/bin/bash -l

#PBS -N nfsmrnaseq

#PBS -l select=1:ncpus=2:mem=4gb

#PBS -l walltime=24:00:00

#work on current directory (folder)

cd $PBS_O_WORKDIR

#load java and set up memory settings to run nextflow

module load java

export NXF_OPTS='-Xms1g -Xmx4g'

# run the test

nextflow run nf-core/smrnaseq -profile test,singularity --outdir results -r 2.1.0

where:

  • nextflow command: nextflow run

  • pipeline name: nf-core/smrnaseq

  • pipeline version: -r 2.1.0

  • container type and sample data: -profile test,singularity

  • output directory: --outdir results

Submitting the job

Now we can submit the small RNAseq test job to the HPC scheduler:

qsub launch_nf-core_smallRNAseq_test.pbs

Monitoring the Run

qjobs

Exercise 2: Running the small RNA pipeline using public human data

The pipeline requires preparing at least 2 files:

  • Metadata file (samplesheet.csv) that specifies the “sample name” and “location of FASTQ files” ('Read 1').

  • PBS Pro script (launch_nf-core_smallRNAseq_human.pbs) with instructions to run the pipeline

Create the metadata file (samplesheet.csv):

Change to the data folder directory:

cd $HOME/workshop/small_RNAseq/data/human pwd

Copy the bash script to the working folder

cp /work/training/smallRNAseq/scripts/create_nf-core_smallRNAseq_samplesheet.sh $HOME/workshop/small_RNAseq/data/human
  • Note: you could replace ‘$HOME/workshop/data’ with “.” A dot indicates ‘current directory’ and will copy the file to the directory where you are currently located

View the content of the script:

cat create_nf-core_smallRNAseq_samplesheet.sh

#!/bin/bash -l

#User defined variables.

##########################################################

DIR='$HOME/workshop/small_RNAseq/data/human'

INDEX='samplesheet.csv'

##########################################################

#load python module

module load python/3.10.8-gcccore-12.2.0

#fetch the script to create the sample metadata table

wget -L https://raw.githubusercontent.com/nf-core/rnaseq/master/bin/fastq_dir_to_samplesheet.py

chmod +x fastq_dir_to_samplesheet.py

#generate initial sample metadata file

./fastq_dir_to_samplesheet.py  $DIR index.csv \

        --strandedness auto \

        --read1_extension .fastq.gz

#format index file

cat index.csv | awk -F "," '{print $1 "," $2}' > ${INDEX}

#Remove intermediate files:

rm index.csv fastq_dir_to_samplesheet.py

#!/bin/bash -l

#User defined variables.

##########################################################

DIR='$HOME/workshop/small_RNAseq/data/human'

INDEX='samplesheet.csv'

##########################################################

#load python module

module load python/3.10.8-gcccore-12.2.0

#fetch the script to create the sample metadata table

wget -L https://raw.githubusercontent.com/nf-core/rnaseq/master/bin/fastq_dir_to_samplesheet.py

chmod +x fastq_dir_to_samplesheet.py

#generate initial sample metadata file

./fastq_dir_to_samplesheet.py  $DIR index.csv \

        --strandedness auto \

        --read1_extension .fastq.gz

#format index file

cat index.csv | awk -F "," '{print $1 "," $2}' > ${INDEX}

#Remove intermediate files:

rm index.csv fastq_dir_to_samplesheet.py

Let’s generate the metadata file by running the following command:

sh create_RNAseq_samplesheet.sh

Check the newly created samplesheet.csv file:

ls -l cat samplesheet.cvs

sample,fastq_1

SRR20753704,/work/training/smallRNAseq/data/SRR20753704.fastq.gz

SRR20753705,/work/training/smallRNAseq/data/SRR20753705.fastq.gz

SRR20753706,/work/training/smallRNAseq/data/SRR20753706.fastq.gz

SRR20753707,/work/training/smallRNAseq/data/SRR20753707.fastq.gz

SRR20753708,/work/training/smallRNAseq/data/SRR20753708.fastq.gz

SRR20753709,/work/training/smallRNAseq/data/SRR20753709.fastq.gz

SRR20753716,/work/training/smallRNAseq/data/SRR20753716.fastq.gz

SRR20753717,/work/training/smallRNAseq/data/SRR20753717.fastq.gz

SRR20753718,/work/training/smallRNAseq/data/SRR20753718.fastq.gz

SRR20753719,/work/training/smallRNAseq/data/SRR20753719.fastq.gz

SRR20753720,/work/training/smallRNAseq/data/SRR20753720.fastq.gz

SRR20753721,/work/training/smallRNAseq/data/SRR20753721.fastq.gz

sample,fastq_1

SRR20753704,/work/training/smallRNAseq/data/SRR20753704.fastq.gz

SRR20753705,/work/training/smallRNAseq/data/SRR20753705.fastq.gz

SRR20753706,/work/training/smallRNAseq/data/SRR20753706.fastq.gz

SRR20753707,/work/training/smallRNAseq/data/SRR20753707.fastq.gz

SRR20753708,/work/training/smallRNAseq/data/SRR20753708.fastq.gz

SRR20753709,/work/training/smallRNAseq/data/SRR20753709.fastq.gz

SRR20753716,/work/training/smallRNAseq/data/SRR20753716.fastq.gz

SRR20753717,/work/training/smallRNAseq/data/SRR20753717.fastq.gz

SRR20753718,/work/training/smallRNAseq/data/SRR20753718.fastq.gz

SRR20753719,/work/training/smallRNAseq/data/SRR20753719.fastq.gz

SRR20753720,/work/training/smallRNAseq/data/SRR20753720.fastq.gz

SRR20753721,/work/training/smallRNAseq/data/SRR20753721.fastq.gz

 

Copy the PBS Pro script for running the full small RNAseq pipeline (launch_nf-core_smallRNAseq_human.pbs)

Copy and paste the code below to the terminal:

cp $HOME/workshop/small_RNAseq/data/human/samplesheet.csv $HOME/workshop/small_RNAseq/run2_smallRNAseq_human cp $HOME/workshop/small_RNAseq/scripts/launch_nf-core_smallRNAseq_human.pbs $HOME/workshop/small_RNAseq/run2_smallRNAseq_human cd $HOME/workshop/small_RNAseq/run2_smallRNAseq_human
  • Line 1: Copy the samplesheet.csv file to the working directory

  • Line 2: copy the launch_nf-core_smallRNAseq_human.pbs submission script to the working directory

  • Line 3: move to the working directory

View the content of the launch_nf-core_RNAseq_QC.pbs script:

cat launch_nf-core_smallRNAseq_human.pbs

#!/bin/bash -l

#PBS -N nfsmallRNAseq

#PBS -l select=1:ncpus=2:mem=4gb

#PBS -l walltime=24:00:00

#PBS -m abe

 

#run the tasks in the current working directory

cd $PBS_O_WORKDIR

#load java and assign up to 4GB RAM memory for nextflow to use

module load java

export NXF_OPTS='-Xms1g -Xmx4g'

 

#run the small RNAseq pipeline

nextflow run nf-core/smrnaseq -r 2.1.0 \

        -profile singularity \

        --outdir results \

        --input samplesheet.csv \

        --genome GRCh38-local \

        --mirtrace_species hsa \

        --three_prime_adapter 'TGGAATTCTCGGGTGCCAAGG' \

        --fastp_min_length 18 \

        --fastp_max_length 30 \

        --hairpin /work/training/smallRNAseq/data/mirbase/hairpin.fa \

        --mature /work/training/smallRNAseq/data/mirbase/mature.fa \

        --mirna_gtf /work/training/smallRNAseq/data/mirbase/hsa.gff3 \

        -resume

#!/bin/bash -l

#PBS -N nfsmallRNAseq

#PBS -l select=1:ncpus=2:mem=4gb

#PBS -l walltime=24:00:00

#PBS -m abe

 

#run the tasks in the current working directory

cd $PBS_O_WORKDIR

#load java and assign up to 4GB RAM memory for nextflow to use

module load java

export NXF_OPTS='-Xms1g -Xmx4g'

 

#run the small RNAseq pipeline

nextflow run nf-core/smrnaseq -r 2.1.0 \

        -profile singularity \

        --outdir results \

        --input samplesheet.csv \

        --genome GRCh38-local \

        --mirtrace_species hsa \

        --three_prime_adapter 'TGGAATTCTCGGGTGCCAAGG' \

        --fastp_min_length 18 \

        --fastp_max_length 30 \

        --hairpin /work/training/smallRNAseq/data/mirbase/hairpin.fa \

        --mature /work/training/smallRNAseq/data/mirbase/mature.fa \

        --mirna_gtf /work/training/smallRNAseq/data/mirbase/hsa.gff3 \

        -resume

Submit the job to the HPC cluster:

qsub launch_nf-core_smallRNAseq_human.pbs

Monitor the progress:

qjobs

The job will take several hours to run, hence we will use precomputed results for the statistical analysis in the next section.

Precomputed results:

We ran the small RNA seq samples and the results can be found at:

/work/training/smallRNAseq/runs/run2_smallRNAseq_human

The results of the miRNA profiling can be found in the folder call “edger”:

results/ ├── edger ├── fastp ├── fastqc ├── genome ├── index ├── mirdeep ├── mirdeep2 ├── mirtop ├── mirtrace ├── multiqc ├── pipeline_info ├── samtools └── unmapped

inside the “edger” folder find the “mature_counts.csv” file:

hairpin_counts.csv hairpin_CPM_heatmap.pdf hairpin_edgeR_MDS_distance_matrix.txt hairpin_edgeR_MDS_plot_coordinates.txt hairpin_edgeR_MDS_plot.pdf hairpin_log2CPM_sample_distances_dendrogram.pdf hairpin_log2CPM_sample_distances_heatmap.pdf hairpin_log2CPM_sample_distances.txt hairpin_logtpm.csv hairpin_logtpm.txt hairpin_normalized_CPM.txt hairpin_unmapped_read_counts.txt mature_counts.csv <-- we will use this file for the statistical analysis in the next section mature_counts.txt mature_CPM_heatmap.pdf mature_edgeR_MDS_distance_matrix.txt mature_edgeR_MDS_plot_coordinates.txt mature_edgeR_MDS_plot.pdf mature_log2CPM_sample_distances_dendrogram.pdf mature_log2CPM_sample_distances_heatmap.pdf mature_log2CPM_sample_distances.txt mature_logtpm.csv mature_logtpm.txt mature_normalized_CPM.txt mature_unmapped_read_counts.txt

Note: the “mature_counts.csv” needs to be transposed prior running the statistical analysis. This can be done either user the R script or using a script called “transpose_csv.py”.

Let’s initially create a “DESeq2” folder and copy the files needed for the statistical analysis:

mkdir -p $HOME/workshop/small_RNAseq/DESeq2 cp $HOME/workshop/small_RNAseq/scripts/transpose_csv.py $HOME/workshop/small_RNAseq/DESeq2 cp $HOME/workshop/small_RNAseq/data/human/metadata_microRNA.txt $HOME/workshop/small_RNAseq/DESeq2 cp /work/training/smallRNAseq/runs/deprecated/run2_smallRNAseq_human/results/edger/mature_counts.csv $HOME/workshop/small_RNAseq/DESeq2 cd $HOME/workshop/small_RNAseq/DESeq2

To transpose the initial “mature_counst.csv” file do the following:

python transpose_csv.py --input mature_counts.csv --out mature_counts.txt

 

Exercise 3: Running the small RNA pipeline using MirGeneDB

TBA

 

Differential expression analysis using RStudio

Differential expression analysis for smRNA-Seq is similar to regular RNA-Seq. Since you have already done the step-wise analysis in session 3, in this session we will streamline the analysis by running a single R script.

As with the previous RNA-Seq, we will also be running this smRNA-Seq differential expression analysis in RStudio on an rVDI virtual machine. The reason is the same as before - to save time as the required R packages are pre-installed on these virtual machines. And, as before, you can also copy and paste this script to RStudio on your local computer and adapt it to your own dataset.

1. Connect to an rVDI virtual desktop machine

To access and run an rVDI virtual desktop:

Go to https://rvdi.qut.edu.au/

Click on ‘VMware Horizon HTML Access

Log on with your QUT username and password

*NOTE: you need to be connected to the QUT network first, either being on campus or connecting remotely via VPN.

 

2. Create a working directory

As we discussed in section 3, R requires that you set a working directory, where it automatically looks for input files/data and outputs figures, tables, etc. We’ll need to first create this directory.

a. Open Windows Explorer.

b. Go to: H:\workshop\small_RNAseq

c. Create a new folder here called ‘DESeq2’ (NOTE: R is case-sensitive, so it must be named exactly like this)

 

3. Run analysis script in RStudio

 

a. Open RStudio

b. Create a new R script ('File'->'New File'-> ‘R script’)

c. Hit the save button and save this file in the working directory you created above (H:\workshop\small_RNAseq\DESeq2). Name the R script ‘DESeq2.R’.

d. Copy and paste the entire script from the code window below into your R script.

e. Run the entire script ('Code'-> ‘Run region’ → ‘Run all’)

 

#### 4. Loading required packages #### # This section needs to be run every time # Load packages bioconductor_packages <- c("DESeq2", "EnhancedVolcano") cran_packages <- c("ggrepel", "ggplot2", "plyr", "reshape2", "FactoMineR", "factoextra", "pheatmap") lapply(cran_packages, require, character.only = TRUE) lapply(bioconductor_packages, require, character.only = TRUE) #### 5. Import your count data #### # Set working directory. # Change this to your working directory # Set your home working directory # NOTE: # Working directory - setwd("H:/workshop/small_RNAseq/DESeq2") # Import your count data. make sure you've created a 'data' subdirectory and put the count table file there. metacounts <- read.csv("W:/training/smallRNAseq/runs/run2_smallRNAseq_human/results/edger/mature_counts.csv", header = TRUE, row.names = 1) # Count table needs to be transposed and converted to a data frame metacounts <- as.data.frame(t(metacounts)) # Import metadata. Again, need a metadata_microRNA.txt file in the data subdirectory. meta <- read.table("W:/training/smallRNAseq/runs/run2_smallRNAseq_human/results/edger/metadata_microRNA.txt", header = TRUE) # Rename sample names to new sample IDs counts <- metacounts[as.character(meta$sample_name)] colnames(counts) <- meta$sample_ID #### 6. Outliers and batch effects #### # This section normalises and transforms the count data so that it can be plotted on a PCA plot and a heatmap ## USER INPUT # Choose the groups you want to plot in a PCA/Heatmap. You can select any 2 or more of the groups (or all of the groups) you have in your 'groups' column of your metadata table # To see what groups are present, run the following: unique(meta$group) # Now add which groups you want to plot (i.e. replace the groupnames below, and add more, separated by a comma and in "quotes", as needed). NOTE: R is case-sensitive, so these group names must be named EXACTLY the same as in the metadata table. plotgroups <- c("normal", "Huntingtons_disease") # Pull out only the counts from the above groups groupcounts <- counts[meta$group %in% plotgroups] # Normalise counts by library size, using DeSeq2's estimateSizeFactors() function. Note that DeSeq2 does this internally during DEG calling. The normalisation below is done separately for PCA and density plotting. # Set up the initial DeSeq2 experimental parameters. condition <- factor(1:length(groupcounts)) # Set up the column data. A data frame of sample ID's and conditions coldata <- data.frame(row.names=colnames(groupcounts), condition) # Set up the DeSeq2 data set structure f <- DESeqDataSetFromMatrix(countData = groupcounts, colData = coldata, design= ~ condition) # Estimate the size factors. See DeSeq2 manual for details f <- estimateSizeFactors(f) # Size factors can be viewed by: sizeFactors(f) # Multiply each row (sample) by the corresponding size factor subcount_norm <- as.matrix(groupcounts) %*% diag(sizeFactors(f)) # Re-add column names colnames(subcount_norm) <- colnames(groupcounts) ## Remove low coverage transcripts (mean count < 10) ## # Find the mean of each row (and output as a data frame object) means <- as.data.frame(rowMeans(subcount_norm)) # Then join the means data with the counts means <- cbind(means, subcount_norm) # Then subset out only genes with mean > 10 data <- subset(means, means[ , 1] > 10) # Remove the means column data <- data[,-1] # Transform data data_log <- vst(round(as.matrix(data)), nsub = nrow(data)-20) # Transformation can create some infinite values. Can't generate PCA data on these. Can see how many by: sum(sapply(data_log, is.infinite)) # To remove infinite rows, use 'is.finte' or '!is.infinite' data_log <- data_log[is.finite(rowSums(data_log)),] colnames(data_log) <- colnames(groupcounts) ### Set up the PCA plot base data ### # We're using the FactoMineR package to generate PCA plots (http://factominer.free.fr/index.html) # Need to transpose the data first data_log_t <- t(data_log) # Add the group data data_log_t_vars <- data.frame(meta$group[meta$group %in% plotgroups], data_log_t) # Generate the PCA data using FactoMineR package res.pca <- PCA(data_log_t_vars, quali.sup = 1, graph=FALSE) ## Set up the dendogram/heatmaps base data ## # Calculate the distance matrix: distance_matrix <- as.matrix(dist(t(data_log))) #### 6a. PCA plot #### # Generate the PCA plot. Groups are shaded with ellipses at 95% confidence level. NOTE: at least 4 replicates need to be in a group for an ellipses to be drawn. # NOTE: change the group point colours by changing 'palette = ' below. Use the 'RColourBrewer' colour names (https://r-graph-gallery.com/38-rcolorbrewers-palettes.html). For example, if you are plotting 3 groups and choose palette = "Set1", this will use the first 3 colours from the Set1 colour palette. p <- fviz_pca_ind(res.pca, geom.ind = c("point", "text"), # show points only (but not "text") col.ind = meta$group[meta$group %in% plotgroups], # color by groups pointsize = 5, label = "all", title = "", legend.title = "Treatment groups", palette = "Dark2", addEllipses = TRUE, ellipse.type = "t", ellipse.level = 0.95) + theme(legend.text = element_text(size = 12), legend.title = element_text(size = 14), axis.title=element_text(size=16), axis.text=element_text(size=14)) p # Output as publication quality (300dpi) tiff and pdf. # This will name your output files with the treatment groups you selected. # Create a 'results_outliers_included' subdirectory where all results_outliers_included will be output dir.create("results_outliers_included", showWarnings = FALSE) # Create a (300dpi) tiff ggsave(file = paste0("./results_outliers_included/PCA_", paste(plotgroups, collapse = "_Vs_"), ".tiff"), dpi = 300, compression = "lzw", device = "tiff", width = 10, height = 8, plot = p) # Create a pdf ggsave(file = paste0("./results_outliers_included/PCA_", paste(plotgroups, collapse = "_Vs_"), ".pdf"), device = "pdf", width = 10, height = 8, plot = p) #### 6b. Samples heatmap and dendrogram #### # This section plots a heatmap and dendrogram of pairwise relationships between samples. In this way you can see if samples cluster by treatment group. # See here: https://davetang.org/muse/2018/05/15/making-a-heatmap-in-r-with-the-pheatmap-package/ # Define annotation column annot_columns <- data.frame(meta$group[meta$group %in% plotgroups]) # Make the row names the sample IDs row.names(annot_columns) <- meta$sample_ID[meta$group %in% plotgroups] colnames(annot_columns) <- "Treatment groups" # Need to factorise it annot_columns[[1]] <- factor(annot_columns[[1]]) # Generate dendrogram and heatmap pheatmap(distance_matrix, color=colorRampPalette(c("white", "#9999FF", "#990000"))(50), cluster_rows = TRUE, show_rownames = TRUE, treeheight_row = 0, treeheight_col = 70, fontsize_col = 12, annotation_names_col = F, annotation_col = annot_columns, filename = paste0("./results_outliers_included/Pairwise_sample_heatmap_", paste(plotgroups, collapse = "_Vs_"), ".tiff")) # Notes about heatmap colours. # You can change the colours used in the heatmap itself by changing the colour names (color=colorRampPalette....) # If you want to change the annotation colours, see here: https://zhiganglu.com/post/pheatmap_change_annotation_colors/ #### 7. Differential expression analysis #### # In this section we use the Deseq2 package to identify differentially expressed genes. ## USER INPUT # Choose the treatment groups you want to compare. # To see what groups are present, run the following: unique(meta$group) # Enter which groups you want to compare (two groups only). BASELINE OR CONTROL GROUP SHOULD BE LISTED FIRST. degroups <- c("normal", "Huntingtons_disease") # From the count table, pull out only the counts from the above groups expdata <- as.matrix(counts[,meta$group %in% degroups]) # Set up the experimental condition # 'factor' sets up the reference level, i.e. which is the baseline group (otherwise the default baseline level is in alphabetic order) condition <- factor(meta$group[meta$group %in% degroups], levels = degroups) # Type 'condition' in the console to see is the levels are set correctly # Set up column data (treatment groups and sample ID) coldata <- data.frame(row.names=colnames(expdata), condition) # Create the DESeq2 dataset (dds) dds <- DESeq2::DESeqDataSetFromMatrix(countData=expdata, colData=coldata, design=~condition) dds$condition <- factor(dds$condition, levels = degroups) # Run DESeq2 to identify differentially expressed genes deseq <- DESeq(dds) # Extract a results table from the DESeq analysis res <- results(deseq) # Reorder results by adjusted p vales, so that the most signififcantly DE genes are at the top res <- res[order(res$padj), ] # You can do a summary of the results to see how many significantly (alpha=0.05, adjust to 0.01 if needed) upregulated and downregulated DE genes were found summary(res, alpha=0.05) # Convert from DESeq object to a data frame. res <- data.frame(res) # Look at the top 6 DE genes head(res) # Add normalised counts to the output table. This is so you can later plot expression trends for individual genes in R, Excel, etc. # Need to normalise the counts first, using the size factors calculated by DESeq2 (in the 'deseq' object) expdata_norm <- as.matrix(expdata) %*% diag(deseq$sizeFactor) colnames(expdata_norm) <- colnames(expdata) annot_counts <- merge(x = res, y = expdata_norm, by = 0, all = TRUE) # Pull out just significant genes (change from 0.05 to 0.01 if needed) DE_genes <- subset(annot_counts, padj < 0.05, select=colnames(annot_counts)) # Export as a csv table write.csv(DE_genes, file=paste0("./results_outliers_included/DE_genes_", paste(degroups, collapse = "_Vs_"), ".csv"), row.names = FALSE) #### 7b. Volcano plot #### p <- EnhancedVolcano(res, lab = row.names(res), selectLab = row.names(res)[1:20], drawConnectors = TRUE, title = NULL, subtitle = NULL, x = 'log2FoldChange', y = 'pvalue') p <- EnhancedVolcano(res, lab = rownames(res), pointSize = 3, drawConnectors = TRUE, title = NULL, subtitle = NULL, x = 'log2FoldChange', y = 'pvalue') p # NOTE: the above plot shows labels for the top significantly DE (i.e. by lowest adjusted p value) genes. # Output as publication quality (300dpi) tiff and pdf. # Create a (300dpi) tiff ggsave(file = paste0("./results_outliers_included/volcano_", paste(degroups, collapse = "_Vs_"), ".tiff"), dpi = 300, compression = "lzw", device = "tiff", width = 10, height = 8, plot = p) # Create a pdf ggsave(file = paste0("./results_outliers_included/volcano_", paste(degroups, collapse = "_Vs_"), ".pdf"), device = "pdf", width = 10, height = 8, plot = p) #### 7c. DE genes heatmaps and dendrograms #### # sort by p-value DE_genes <- DE_genes[order(DE_genes$padj), ] row.names(DE_genes) <- DE_genes$Row.names # Pull out normalised counts only siggc <- DE_genes[colnames(DE_genes) %in% colnames(expdata)] # Scale and center each row. This is important to visualise relative differences between groups and not have row-wise colouration dominated by high or low gene expression. xts <- scale(t(siggc)) xtst <- t(xts) # Define annotation column annot_columns <- data.frame(meta$group[meta$group %in% degroups]) # Make the row names the sample IDs row.names(annot_columns) <- meta$sample_ID[meta$group %in% degroups] colnames(annot_columns) <- "Treatment groups" # Need to factorise it annot_columns[[1]] <- factor(annot_columns[[1]]) # Generate dendrogram and heatmap for ALL DE genes pheatmap(xtst, color=colorRampPalette(c("#D55E00", "white", "#0072B2"))(100), annotation_col=annot_columns, annotation_names_col = F, fontsize_col = 12, fontsize_row = 7, labels_row = row.names(siggc), show_rownames = T, filename = paste0("./results_outliers_included/All_DEG_Heatmap_", paste(plotgroups, collapse = "_Vs_"), ".tiff")) #### OUTLIER REMOVAL #### # This section repeats the above, but removes outliers first # REMOVE OUTLIERS FROM METADATA TABLE AN COUNT TABLE. meta <- meta[- grep(c("WT4"), meta$sample_ID),] counts <- counts[- grep(c("WT4"), colnames(counts))] #### 8. Outliers and batch effects #### # This section normalises and transforms the count data so that it can be plotted on a PCA plot and a heatmap ## USER INPUT # Choose the groups you want to plot in a PCA/Heatmap. You can select any 2 or more of the groups (or all of the groups) you have in your 'groups' column of your metadata table # To see what groups are present, run the following: unique(meta$group) # Now add which groups you want to plot (i.e. replace the groupnames below, and add more, separated by a comma and in "quotes", as needed). NOTE: R is case-sensitive, so these group names must be named EXACTLY the same as in the metadata table. plotgroups <- c("normal", "Huntingtons_disease") # Pull out only the counts from the above groups groupcounts <- counts[meta$group %in% plotgroups] # Normalise counts by library size, using DeSeq2's estimateSizeFactors() function. Note that DeSeq2 does this internally during DEG calling. The normalisation below is done separately for PCA and density plotting. # Set up the initial DeSeq2 experimental parameters. condition <- factor(1:length(groupcounts)) # Set up the column data. A data frame of sample ID's and conditions coldata <- data.frame(row.names=colnames(groupcounts), condition) # Set up the DeSeq2 data set structure f <- DESeqDataSetFromMatrix(countData = groupcounts, colData = coldata, design= ~ condition) # Estimate the size factors. See DeSeq2 manual for details f <- estimateSizeFactors(f) # Size factors can be viewed by: sizeFactors(f) # Multiply each row (sample) by the corresponding size factor subcount_norm <- as.matrix(groupcounts) %*% diag(sizeFactors(f)) # Re-add column names colnames(subcount_norm) <- colnames(groupcounts) ## Remove low coverage transcripts (mean count < 10) ## # Find the mean of each row (and output as a data frame object) means <- as.data.frame(rowMeans(subcount_norm)) # Then join the means data with the counts means <- cbind(means, subcount_norm) # Then subset out only genes with mean > 10 data <- subset(means, means[ , 1] > 10) # Remove the means column data <- data[,-1] # Transform data data_log <- vst(round(as.matrix(data)), nsub = nrow(data)-20) # Transformation can create some infinite values. Can't generate PCA data on these. Can see how many by: sum(sapply(data_log, is.infinite)) # To remove infinite rows, use 'is.finte' or '!is.infinite' data_log <- data_log[is.finite(rowSums(data_log)),] colnames(data_log) <- colnames(groupcounts) ### Set up the PCA plot base data ### # We're using the FactoMineR package to generate PCA plots (http://factominer.free.fr/index.html) # Need to transpose the data first data_log_t <- t(data_log) # Add the group data data_log_t_vars <- data.frame(meta$group[meta$group %in% plotgroups], data_log_t) # Generate the PCA data using FactoMineR package res.pca <- PCA(data_log_t_vars, quali.sup = 1, graph=FALSE) ## Set up the dendogram/heatmaps base data ## # Calculate the distance matrix: distance_matrix <- as.matrix(dist(t(data_log))) #### 8a. PCA plot #### # Generate the PCA plot. Groups are shaded with ellipses at 95% confidence level. NOTE: at least 4 replicates need to be in a group for an ellipses to be drawn. # NOTE: change the group point colours by changing 'palette = ' below. Use the 'RColourBrewer' colour names (https://r-graph-gallery.com/38-rcolorbrewers-palettes.html). For example, if you are plotting 3 groups and choose palette = "Set1", this will use the first 3 colours from the Set1 colour palette. p <- fviz_pca_ind(res.pca, geom.ind = c("point", "text"), # show points only (but not "text") col.ind = meta$group[meta$group %in% plotgroups], # color by groups pointsize = 5, label = "all", title = "", legend.title = "Treatment groups", palette = "Dark2", addEllipses = TRUE, ellipse.type = "t", ellipse.level = 0.95) + theme(legend.text = element_text(size = 12), legend.title = element_text(size = 14), axis.title=element_text(size=16), axis.text=element_text(size=14)) p # Output as publication quality (300dpi) tiff and pdf. # This will name your output files with the treatment groups you selected. # Create a 'results_outliers_removed' subdirectory where all results_outliers_removed will be output dir.create("results_outliers_removed", showWarnings = FALSE) # Create a (300dpi) tiff ggsave(file = paste0("./results_outliers_removed/PCA_", paste(plotgroups, collapse = "_Vs_"), ".tiff"), dpi = 300, compression = "lzw", device = "tiff", width = 10, height = 8, plot = p) # Create a pdf ggsave(file = paste0("./results_outliers_removed/PCA_", paste(plotgroups, collapse = "_Vs_"), ".pdf"), device = "pdf", width = 10, height = 8, plot = p) #### 8b. Samples heatmap and dendrogram #### # This section plots a heatmap and dendrogram of pairwise relationships between samples. In this way you can see if samples cluster by treatment group. # See here: https://davetang.org/muse/2018/05/15/making-a-heatmap-in-r-with-the-pheatmap-package/ # Define annotation column annot_columns <- data.frame(meta$group[meta$group %in% plotgroups]) # Make the row names the sample IDs row.names(annot_columns) <- meta$sample_ID[meta$group %in% plotgroups] colnames(annot_columns) <- "Treatment groups" # Need to factorise it annot_columns[[1]] <- factor(annot_columns[[1]]) # Generate dendrogram and heatmap pheatmap(distance_matrix, color=colorRampPalette(c("white", "#9999FF", "#990000"))(50), cluster_rows = TRUE, show_rownames = TRUE, treeheight_row = 0, treeheight_col = 70, fontsize_col = 12, annotation_names_col = F, annotation_col = annot_columns, filename = paste0("./results_outliers_removed/Pairwise_sample_heatmap_", paste(plotgroups, collapse = "_Vs_"), ".tiff")) # Notes about heatmap colours. # You can change the colours used in the heatmap itself by changing the colour names (color=colorRampPalette....) # If you want to change the annotation colours, see here: https://zhiganglu.com/post/pheatmap_change_annotation_colors/ #### 9. Differential expression analysis #### # In this section we use the Deseq2 package to identify differentially expressed genes. ## USER INPUT # Choose the treatment groups you want to compare. # To see what groups are present, run the following: unique(meta$group) # Enter which groups you want to compare (two groups only). BASELINE OR CONTROL GROUP SHOULD BE LISTED FIRST. degroups <- c("normal", "Huntingtons_disease") # From the count table, pull out only the counts from the above groups expdata <- as.matrix(counts[,meta$group %in% degroups]) # Set up the experimental condition # 'factor' sets up the reference level, i.e. which is the baseline group (otherwise the default baseline level is in alphabetic order) condition <- factor(meta$group[meta$group %in% degroups], levels = degroups) # Type 'condition' in the console to see is the levels are set correctly # Set up column data (treatment groups and sample ID) coldata <- data.frame(row.names=colnames(expdata), condition) # Create the DESeq2 dataset (dds) dds <- DESeq2::DESeqDataSetFromMatrix(countData=expdata, colData=coldata, design=~condition) dds$condition <- factor(dds$condition, levels = degroups) # Run DESeq2 to identify differentially expressed genes deseq <- DESeq(dds) # Extract a results table from the DESeq analysis res <- results(deseq) # Reorder results by adjusted p vales, so that the most signififcantly DE genes are at the top res <- res[order(res$padj), ] # You can do a summary of the results to see how many significantly (alpha=0.05, adjust to 0.01 if needed) upregulated and downregulated DE genes were found summary(res, alpha=0.05) # Convert from DESeq object to a data frame. res <- data.frame(res) # Look at the top 6 DE genes head(res) # Add normalised counts to the output table. This is so you can later plot expression trends for individual genes in R, Excel, etc. # Need to normalise the counts first, using the size factors calculated by DESeq2 (in the 'deseq' object) expdata_norm <- as.matrix(expdata) %*% diag(deseq$sizeFactor) colnames(expdata_norm) <- colnames(expdata) annot_counts <- merge(x = res, y = expdata_norm, by = 0, all = TRUE) # Pull out just significant genes (change from 0.05 to 0.01 if needed) DE_genes <- subset(annot_counts, padj < 0.05, select=colnames(annot_counts)) # Export as a csv table write.csv(DE_genes, file=paste0("./results_outliers_removed/DE_genes_", paste(degroups, collapse = "_Vs_"), ".csv"), row.names = FALSE) #### 9b. Volcano plot #### p <- EnhancedVolcano(res, lab = row.names(res), selectLab = row.names(res)[1:20], drawConnectors = TRUE, title = NULL, subtitle = NULL, x = 'log2FoldChange', y = 'pvalue') p <- EnhancedVolcano(res, lab = rownames(res), pointSize = 3, drawConnectors = TRUE, title = NULL, subtitle = NULL, x = 'log2FoldChange', y = 'pvalue') p # NOTE: the above plot shows labels for the top significantly DE (i.e. by lowest adjusted p value) genes. # Output as publication quality (300dpi) tiff and pdf. # Create a (300dpi) tiff ggsave(file = paste0("./results_outliers_removed/volcano_", paste(degroups, collapse = "_Vs_"), ".tiff"), dpi = 300, compression = "lzw", device = "tiff", width = 10, height = 8, plot = p) # Create a pdf ggsave(file = paste0("./results_outliers_removed/volcano_", paste(degroups, collapse = "_Vs_"), ".pdf"), device = "pdf", width = 10, height = 8, plot = p) #### 9c. DE genes heatmaps and dendrograms #### # sort by p-value DE_genes <- DE_genes[order(DE_genes$padj), ] row.names(DE_genes) <- DE_genes$Row.names # Pull out normalised counts only siggc <- DE_genes[colnames(DE_genes) %in% colnames(expdata)] # Scale and center each row. This is important to visualise relative differences between groups and not have row-wise colouration dominated by high or low gene expression. xts <- scale(t(siggc)) xtst <- t(xts) # Define annotation column annot_columns <- data.frame(meta$group[meta$group %in% degroups]) # Make the row names the sample IDs row.names(annot_columns) <- meta$sample_ID[meta$group %in% degroups] colnames(annot_columns) <- "Treatment groups" # Need to factorise it annot_columns[[1]] <- factor(annot_columns[[1]]) # Generate dendrogram and heatmap for ALL DE genes pheatmap(xtst, color=colorRampPalette(c("#D55E00", "white", "#0072B2"))(100), annotation_col=annot_columns, annotation_names_col = F, fontsize_col = 12, fontsize_row = 7, labels_row = row.names(siggc), show_rownames = T, filename = paste0("./results_outliers_removed/All_DEG_Heatmap_", paste(plotgroups, collapse = "_Vs_"), ".tiff"))

Running R Scripts on the HPC

If all your data is on the HPC, or your analysis is too large or takes too long on your desktop/laptop, it is possible to run the R scripts on the HPC.

 

Preparing your R script for the HPC

QUT’s HPC is based on Linux so the path names of where your files are, are likely different on the HPC so we must update them to the HPC path.

Using R studio, you can adjust the paths in your script. In the DESeq2.R script, there are a number of places that we need to change for it to work on the HPC:

# Line 21 needs to be changed to: setwd("~/workshop/small_RNAseq/DESeq2") # Line 25: metacounts <- read.table("/work/training/smallRNAseq/runs/run2_smallRNAseq_human/results/edger/mature_counts.txt", header = TRUE, row.names = 1) # Line 30 meta <- read.table("/work/training/smallRNAseq/runs/run2_smallRNAseq_human/results/edger/metadata_microRNA.txt", header = TRUE)

The H: and W: drives to not exist on the HPC. The folders are there, just under a different path.

 

Preparing a Script to run the R script on the HPC

A job script needs to be built to request resources and run the script. This one work's well for the DESeq2.R script:

#!/bin/bash -l #PBS -N R_analysis #PBS -l select=1:ncpus=1:mem=4gb #PBS -l walltime=00:10:00 #PBS -m abe module load r/4.2.1-foss-2022a export R_LIBS_USER='/work/training/smallRNAseq/r_library' cd $PBS_O_WORKDIR Rscript DESeq2.R

Using R Studio, create a Text File and paste in the contents of this script.

Save it as launch_R.pbs in H:\workshop\small_RNAseq\DESeq2 (Same folder as DESeq2.R (Remember, H: is pointed at your HPC Home Folder.

 

Running the Script on the HPC

Now the script is on the HPC, we can run it, but we have to convert it first. R Studio on Windows will save the text file as a “Windows” format file. The HPC has trouble reading this file so we can easily convert it “Linux” format file. Once we have converted the file, we can submit the script to the scheduler and wait for it to run.

# Convert the launch_R.pbs to Linux format dos2unix launch_R.pbs #Once this is run, you do not need to run it again, unless you edit it on R Studio again # Submit the job to the HPC qsub launch_R.pbs # Check the status of the job qjobs

Installing R packages on the HPC (Not Needed Today)

Just like R Studio on a Windows Computer, before you can run your R script you need to install the packages your script needs. We have done this for you for this training session but to install your own packages you can follow a procedure like this:

bioconductor_packages <- c("clusterProfiler", "pathview", "AnnotationHub", "org.Hs.eg.db") cran_packages <- c("tidyverse", "ggplot2", "plyr", "readxl", "scales") # Compares installed packages to above packages and returns a vector of missing packages new_packages <- bioconductor_packages[!(bioconductor_packages %in% installed.packages()[,"Package"])] new_cran_packages <- cran_packages[!(cran_packages %in% installed.packages()[,"Package"])] # Install missing bioconductor packages if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(new_packages) # Install missing cran packages if (length(new_cran_packages)) install.packages(new_cran_packages, repos = "http://cran.us.r-project.org") # Update all installed packages to the latest version update.packages(bioconductor_packages, ask = FALSE) update.packages(cran_packages, ask = FALSE, repos = "http://cran.us.r-project.org")

Save this as install.R

You can then run this on the HPC like before with this submission script:

#!/bin/bash -l #PBS -N R_install #PBS -l select=1:ncpus=1:mem=4gb #PBS -l walltime=2:00:00 #PBS -m abe module load r/4.2.1-foss-2022a cd $PBS_O_WORKDIR Rscript install.R

Save this as install.pbs

Run this on the HPC:

# Submit the job to the HPC qsub install.pbs # Check the status of the job qjobs