/
Copy of Copy of Hands-on smRNAseq training

Copy of Copy of Hands-on smRNAseq training

Public small RNA-seq data

Public human small RNAseq data:

https://www.ebi.ac.uk/ena/browser/view/PRJNA861019 Integrative analysis of renal microRNA and mRNA to identify hub genes and pivotal pathways associated with Cyclosporine-induced acute kidney injury in mice

Work in the HPC

Work in the HPC

Before we start using the HPC, let’s start an interactive session:

qsub -I -S /bin/bash -l walltime=10:00:00 -l select=1:ncpus=1:mem=4gb

Get a copy of the scripts to be used in this module

Use the terminal to log into the HPC and create a /RNAseq/ folder to run the nf-core/rnaseq pipeline. For example:

mkdir -p $HOME/workshop/small_RNAseq/scripts cp /work/training/small_rnaseq/scripts/* $HOME/workshop/small_RNAseq/scripts/ ls -l $HOME/workshop/small_RNAseq/scripts/
  • Line 1: The -p indicates create 'parental directories as required. Thus the line 1 command creates both /workshop/ and the subfolder /workshop/scripts/

  • Line 2: Copies all files from /work/datasets/workshop/scripts/ as noted by an asterisk to the newly created folder $HOME/workshop/scripts/

Copy public data to your $HOME

mkdir -p $HOME/workshop/small_RNAseq/data cp /work/training/smallRNAseq/data/* $HOME/workshop/small_RNAseq/data/ # list the content of the $HOME/workshop/small_RNAseq/data/
  • Line 1: The first command creates the folder /scripts/

  • Line 2: Copies all files from /work/datasets/workshop/scripts/ folder as noted by an asterisk to newly created $HOME/workshop/scripts/ folder

  • Line 3: a quick challenge - see the previous section for hints

Create a folder for running the nf-core small RNA-seq pipeline

Let’s create a “runs” folder to run the nf-core/rnaseq pipeline:

mkdir -p $HOME/workshop/small_RNAseq mkdir $HOME/workshop/small_RNAseq/run1_test mkdir $HOME/workshop/small_RNAseq/run2_smallRNAseq cd $HOME/workshop/
  • Lines 1-4: create sub-folders for each exercise

  • Line 5: change the directory to the folder “run1_test”

  • Line 6: print the current working directory

Exercise 1: Running a test with nf-core sample data

First, let’s assess the execution of the nf-core/rnaseq pipeline by running a test using sample data.

Copy the launch_nf-core_RNAseq_test.pbs to the working directory

cd $HOME/workshop/RNAseq/run1_test cp $HOME/workshop/scripts/launch_nf-core_RNAseq_test.pbs .

View the content of the script as follows:

cat launch_nf-core_RNAseq_test.pbs

Download Reference microRNA sequences from miRBase

First, let’s download a copy of miRBAse reference sequences, including hairpin and mature microRNA sequences.

microRNA mature sequences:

wget https://mirbase.org/download/mature.fa

Hairpin sequences:

wget https://mirbase.org/download/hairpin.fa

Fetch the genomic coordinated for precursors and mature sequences:

wget https://mirbase.org/download/hsa.gff3

Alternatively, submit the following PBS Pro script to the cluster. Before running the script, create a ‘reference’ folder (i.e., /myteam/data/reference/ ).

#!/bin/bash -l #PBS -N nfsmrnaseq #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=2:00:00 cd $PBS_O_WORKDIR wget https://www.mirbase.org/download/hairpin.fa wget https://www.mirbase.org/download/mature.fa wget https://www.mirbase.org/download/hsa.gff3

Run a test

Before running the pipeline with real data, run the following test:

nextflow run nf-core/smrnaseq -profile test,singularity --outdir results -r 2.1.0

To submit the above command to the HPC cluster, prepare the following script:

#!/bin/bash -l #PBS -N nfsmrnaseq #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=24:00:00 #work on current directory (folder) cd $PBS_O_WORKDIR #load java and set up memory settings to run nextflow module load java NXF_OPTS='-Xms1g -Xmx4g' nextflow run nf-core/smrnaseq -profile test,singularity --outdir results -r 2.1.0

Submitting the job

Once you have created the samplesheet.csv file and have a copy of the launch_nf-core_smallRNAseq_test.pbs script, submit the job to the HPC as follows:

qsub launch.pbs

Monitoring the Run

Use the command

qjobs

to check on the job that you are running. Note, that Nextflow will launch additional jobs during the run.

You can also check the .nextflow.log file for details on what is going on.

Preparing a sample metadata file

Now let’s prepare a samplesheet.csv file that specifies the name of your samples and the location of the raw FASTQ files

sample,fastq_1 SRR20753704,/work/training/smallRNAseq/data/SRR20753704.fastq.gz SRR20753705,/work/training/smallRNAseq/data/SRR20753705.fastq.gz SRR20753706,/work/training/smallRNAseq/data/SRR20753706.fastq.gz SRR20753707,/work/training/smallRNAseq/data/SRR20753707.fastq.gz SRR20753708,/work/training/smallRNAseq/data/SRR20753708.fastq.gz SRR20753709,/work/training/smallRNAseq/data/SRR20753709.fastq.gz SRR20753716,/work/training/smallRNAseq/data/SRR20753716.fastq.gz SRR20753717,/work/training/smallRNAseq/data/SRR20753717.fastq.gz SRR20753718,/work/training/smallRNAseq/data/SRR20753718.fastq.gz SRR20753719,/work/training/smallRNAseq/data/SRR20753719.fastq.gz SRR20753720,/work/training/smallRNAseq/data/SRR20753720.fastq.gz SRR20753721,/work/training/smallRNAseq/data/SRR20753721.fastq.gz

To generate the above file, let’s use the following shell script (i.e., called “create_nf-core_smallRNAseq_samplesheet.sh”)

#!/bin/bash -l #User defined variables. ########################################################## #DIR='/path/to/the/FASTQ/files' DIR=$1 INDEX='samplesheet.csv' ########################################################## #load python module module load python/3.10.8-gcccore-12.2.0 #fetch the script to create the sample metadata table wget -L https://raw.githubusercontent.com/nf-core/rnaseq/master/bin/fastq_dir_to_samplesheet.py chmod +x fastq_dir_to_samplesheet.py #generate initial sample metadata file ./fastq_dir_to_samplesheet.py $DIR index.csv \ --strandedness auto \ --read1_extension .fastq.gz #format index file cat index.csv | awk -F "," '{print $1 "," $2}' > ${INDEX} #Remove intermediate files: rm index.csv fastq_dir_to_samplesheet.py

Assign to the “DIR” variable above the path where the raw FASTQ files are located. For example:

pwd

Copy and paste the path to the above script using VI or VIM (check prerequisites above).

Run the nextflow nf-core/smRNAseq pipeline.

Create a launch_nfsmRNAseq.pbs file that has the following information:

#!/bin/bash -l #PBS -N nfsmrnaseq #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=24:00:00 cd $PBS_O_WORKDIR module load java NXF_OPTS='-Xms1g -Xmx4g' nextflow run nf-core/smrnaseq -r 2.1.0 \ -profile singularity \ --outdir outdir \ --input samplesheet.csv \ --genome GRCh38 \ --three_prime_adapter 'AACTGTAGGCACCATCAAT'\ --fastp_min_length 18 \ --fastp_max_length 30 \ --hairpin /work/trtp/data/mirbase/hairpin.fa \ --mature /work/trtp/data/mirbase/mature.fa \ --mirna_gtf /work/trtp/data/mirbase/hsa.gff3

Submit the job to the HPC cluster:

qsub launch.pbs

Monitor the progress:

qjobs

 

R differential expression script

 

#### 3. Loading required packages #### # This section needs to be run every time # Load packages bioconductor_packages <- c("DESeq2", "EnhancedVolcano") cran_packages <- c("ggrepel", "ggplot2", "plyr", "reshape2", "FactoMineR", "factoextra", "pheatmap") lapply(cran_packages, require, character.only = TRUE) lapply(bioconductor_packages, require, character.only = TRUE) #### 4. Import your count data #### # Set working directory. # Change this to your working directory # Your home wd setwd("C:/Users/whatmorp/OneDrive - Queensland University of Technology/Desktop/Teaching/HPC_training_workshops/smRNA_seq") # HPC wd #setwd("/work/training/smallRNAseq/") # Import your count data. make sure you've created a 'data' subdirectory and put the count table file there. metacounts <- read.table("./data/mature_counts.txt", header = TRUE, row.names = 1) # Import metadata. Again, need a metadata_microRNA.txt file in the data subdirectory. meta <- read.table("./data/metadata_microRNA.txt", header = TRUE) # Rename sample names to new sample IDs counts <- metacounts[as.character(meta$sample_name)] colnames(counts) <- meta$sample_ID #### 5. Outliers and batch effects #### # This section normalises and transforms the count data so that it can be plotted on a PCA plot and a heatmap ## USER INPUT # Choose the groups you want to plot in a PCA/Heatmap. You can select any 2 or more of the groups (or all of the groups) you have in your 'groups' column of your metadata table # To see what groups are present, run the following: unique(meta$group) # Now add which groups you want to plot (i.e. replace the groupnames below, and add more, separated by a comma and in "quotes", as needed). NOTE: R is case-sensitive, so these group names must be named EXACTLY the same as in the metadata table. plotgroups <- c("normal", "Huntingtons_disease") # Pull out only the counts from the above groups groupcounts <- counts[meta$group %in% plotgroups] # Normalise counts by library size, using DeSeq2's estimateSizeFactors() function. Note that DeSeq2 does this internally during DEG calling. The normalisation below is done separately for PCA and density plotting. # Set up the initial DeSeq2 experimental parameters. condition <- factor(1:length(groupcounts)) # Set up the column data. A data frame of sample ID's and conditions coldata <- data.frame(row.names=colnames(groupcounts), condition) # Set up the DeSeq2 data set structure f <- DESeqDataSetFromMatrix(countData = groupcounts, colData = coldata, design= ~ condition) # Estimate the size factors. See DeSeq2 manual for details f <- estimateSizeFactors(f) # Size factors can be viewed by: sizeFactors(f) # Multiply each row (sample) by the corresponding size factor subcount_norm <- as.matrix(groupcounts) %*% diag(sizeFactors(f)) # Re-add column names colnames(subcount_norm) <- colnames(groupcounts) ## Remove low coverage transcripts (mean count < 10) ## # Find the mean of each row (and output as a data frame object) means <- as.data.frame(rowMeans(subcount_norm)) # Then join the means data with the counts means <- cbind(means, subcount_norm) # Then subset out only genes with mean > 10 data <- subset(means, means[ , 1] > 10) # Remove the means column data <- data[,-1] # Transform data data_log <- vst(round(as.matrix(data)), nsub = nrow(data)-20) # Transformation can create some infinite values. Can't generate PCA data on these. Can see how many by: sum(sapply(data_log, is.infinite)) # To remove infinite rows, use 'is.finte' or '!is.infinite' data_log <- data_log[is.finite(rowSums(data_log)),] colnames(data_log) <- colnames(groupcounts) ### Set up the PCA plot base data ### # We're using the FactoMineR package to generate PCA plots (http://factominer.free.fr/index.html) # Need to transpose the data first data_log_t <- t(data_log) # Add the group data data_log_t_vars <- data.frame(meta$group[meta$group %in% plotgroups], data_log_t) # Generate the PCA data using FactoMineR package res.pca <- PCA(data_log_t_vars, quali.sup = 1, graph=FALSE) ## Set up the dendogram/heatmaps base data ## # Calculate the distance matrix: distance_matrix <- as.matrix(dist(t(data_log))) #### 5a. PCA plot #### # Generate the PCA plot. Groups are shaded with ellipses at 95% confidence level. NOTE: at least 4 replicates need to be in a group for an ellipses to be drawn. # NOTE: change the group point colours by changing 'palette = ' below. Use the 'RColourBrewer' colour names (https://r-graph-gallery.com/38-rcolorbrewers-palettes.html). For example, if you are plotting 3 groups and choose palette = "Set1", this will use the first 3 colours from the Set1 colour palette. p <- fviz_pca_ind(res.pca, geom.ind = c("point", "text"), # show points only (but not "text") col.ind = meta$group[meta$group %in% plotgroups], # color by groups pointsize = 5, label = "all", title = "", legend.title = "Treatment groups", palette = "Dark2", addEllipses = TRUE, ellipse.type = "t", ellipse.level = 0.95) + theme(legend.text = element_text(size = 12), legend.title = element_text(size = 14), axis.title=element_text(size=16), axis.text=element_text(size=14)) p # Output as publication quality (300dpi) tiff and pdf. # This will name your output files with the treatment groups you selected. # Create a 'results_outliers_included' subdirectory where all results_outliers_included will be output dir.create("results_outliers_included", showWarnings = FALSE) # Create a (300dpi) tiff ggsave(file = paste0("./results_outliers_included/PCA_", paste(plotgroups, collapse = "_Vs_"), ".tiff"), dpi = 300, compression = "lzw", device = "tiff", width = 10, height = 8, plot = p) # Create a pdf ggsave(file = paste0("./results_outliers_included/PCA_", paste(plotgroups, collapse = "_Vs_"), ".pdf"), device = "pdf", width = 10, height = 8, plot = p) #### 5b. Samples heatmap and dendrogram #### # This section plots a heatmap and dendrogram of pairwise relationships between samples. In this way you can see if samples cluster by treatment group. # See here: https://davetang.org/muse/2018/05/15/making-a-heatmap-in-r-with-the-pheatmap-package/ # Define annotation column annot_columns <- data.frame(meta$group[meta$group %in% plotgroups]) # Make the row names the sample IDs row.names(annot_columns) <- meta$sample_ID[meta$group %in% plotgroups] colnames(annot_columns) <- "Treatment groups" # Need to factorise it annot_columns[[1]] <- factor(annot_columns[[1]]) # Generate dendrogram and heatmap pheatmap(distance_matrix, color=colorRampPalette(c("white", "#9999FF", "#990000"))(50), cluster_rows = TRUE, show_rownames = TRUE, treeheight_row = 0, treeheight_col = 70, fontsize_col = 12, annotation_names_col = F, annotation_col = annot_columns, filename = paste0("./results_outliers_included/Pairwise_sample_heatmap_", paste(plotgroups, collapse = "_Vs_"), ".tiff")) # Notes about heatmap colours. # You can change the colours used in the heatmap itself by changing the colour names (color=colorRampPalette....) # If you want to change the annotation colours, see here: https://zhiganglu.com/post/pheatmap_change_annotation_colors/ #### 6. Differential expression analysis #### # In this section we use the Deseq2 package to identify differentially expressed genes. ## USER INPUT # Choose the treatment groups you want to compare. # To see what groups are present, run the following: unique(meta$group) # Enter which groups you want to compare (two groups only). BASELINE OR CONTROL GROUP SHOULD BE LISTED FIRST. degroups <- c("normal", "Huntingtons_disease") # From the count table, pull out only the counts from the above groups expdata <- as.matrix(counts[,meta$group %in% degroups]) # Set up the experimental condition # 'factor' sets up the reference level, i.e. which is the baseline group (otherwise the default baseline level is in alphabetic order) condition <- factor(meta$group[meta$group %in% degroups], levels = degroups) # Type 'condition' in the console to see is the levels are set correctly # Set up column data (treatment groups and sample ID) coldata <- data.frame(row.names=colnames(expdata), condition) # Create the DESeq2 dataset (dds) dds <- DESeq2::DESeqDataSetFromMatrix(countData=expdata, colData=coldata, design=~condition) dds$condition <- factor(dds$condition, levels = degroups) # Run DESeq2 to identify differentially expressed genes deseq <- DESeq(dds) # Extract a results table from the DESeq analysis res <- results(deseq) # Reorder results by adjusted p vales, so that the most signififcantly DE genes are at the top res <- res[order(res$padj), ] # You can do a summary of the results to see how many significantly (alpha=0.05, adjust to 0.01 if needed) upregulated and downregulated DE genes were found summary(res, alpha=0.05) # Convert from DESeq object to a data frame. res <- data.frame(res) # Look at the top 6 DE genes head(res) # Add normalised counts to the output table. This is so you can later plot expression trends for individual genes in R, Excel, etc. # Need to normalise the counts first, using the size factors calculated by DESeq2 (in the 'deseq' object) expdata_norm <- as.matrix(expdata) %*% diag(deseq$sizeFactor) colnames(expdata_norm) <- colnames(expdata) annot_counts <- merge(x = res, y = expdata_norm, by = 0, all = TRUE) # Pull out just significant genes (change from 0.05 to 0.01 if needed) DE_genes <- subset(annot_counts, padj < 0.05, select=colnames(annot_counts)) # Export as a csv table write.csv(DE_genes, file=paste0("./results_outliers_included/DE_genes_", paste(degroups, collapse = "_Vs_"), ".csv"), row.names = FALSE) #### 6b. Volcano plot #### p <- EnhancedVolcano(res, lab = row.names(res), selectLab = row.names(res)[1:20], drawConnectors = TRUE, title = NULL, subtitle = NULL, x = 'log2FoldChange', y = 'pvalue') p <- EnhancedVolcano(res, lab = rownames(res), pointSize = 3, drawConnectors = TRUE, title = NULL, subtitle = NULL, x = 'log2FoldChange', y = 'pvalue') p # NOTE: the above plot shows labels for the top significantly DE (i.e. by lowest adjusted p value) genes. # Output as publication quality (300dpi) tiff and pdf. # Create a (300dpi) tiff ggsave(file = paste0("./results_outliers_included/volcano_", paste(degroups, collapse = "_Vs_"), ".tiff"), dpi = 300, compression = "lzw", device = "tiff", width = 10, height = 8, plot = p) # Create a pdf ggsave(file = paste0("./results_outliers_included/volcano_", paste(degroups, collapse = "_Vs_"), ".pdf"), device = "pdf", width = 10, height = 8, plot = p) #### 6c. DE genes heatmaps and dendrograms #### # sort by p-value DE_genes <- DE_genes[order(DE_genes$padj), ] row.names(DE_genes) <- DE_genes$Row.names # Pull out normalised counts only siggc <- DE_genes[colnames(DE_genes) %in% colnames(expdata)] # Scale and center each row. This is important to visualise relative differences between groups and not have row-wise colouration dominated by high or low gene expression. xts <- scale(t(siggc)) xtst <- t(xts) # Define annotation column annot_columns <- data.frame(meta$group[meta$group %in% degroups]) # Make the row names the sample IDs row.names(annot_columns) <- meta$sample_ID[meta$group %in% degroups] colnames(annot_columns) <- "Treatment groups" # Need to factorise it annot_columns[[1]] <- factor(annot_columns[[1]]) # Generate dendrogram and heatmap for ALL DE genes pheatmap(xtst, color=colorRampPalette(c("#D55E00", "white", "#0072B2"))(100), annotation_col=annot_columns, annotation_names_col = F, fontsize_col = 12, fontsize_row = 7, labels_row = row.names(siggc), show_rownames = T, filename = paste0("./results_outliers_included/All_DEG_Heatmap_", paste(plotgroups, collapse = "_Vs_"), ".tiff")) #### OUTLIER REMOVAL #### # This section repeats the above, but removes outliers first # REMOVE OUTLIERS FROM METADATA TABLE AN COUNT TABLE. meta <- meta[- grep(c("WT4"), meta$sample_ID),] counts <- counts[- grep(c("WT4"), colnames(counts))] #### 5. Outliers and batch effects #### # This section normalises and transforms the count data so that it can be plotted on a PCA plot and a heatmap ## USER INPUT # Choose the groups you want to plot in a PCA/Heatmap. You can select any 2 or more of the groups (or all of the groups) you have in your 'groups' column of your metadata table # To see what groups are present, run the following: unique(meta$group) # Now add which groups you want to plot (i.e. replace the groupnames below, and add more, separated by a comma and in "quotes", as needed). NOTE: R is case-sensitive, so these group names must be named EXACTLY the same as in the metadata table. plotgroups <- c("normal", "Huntingtons_disease") # Pull out only the counts from the above groups groupcounts <- counts[meta$group %in% plotgroups] # Normalise counts by library size, using DeSeq2's estimateSizeFactors() function. Note that DeSeq2 does this internally during DEG calling. The normalisation below is done separately for PCA and density plotting. # Set up the initial DeSeq2 experimental parameters. condition <- factor(1:length(groupcounts)) # Set up the column data. A data frame of sample ID's and conditions coldata <- data.frame(row.names=colnames(groupcounts), condition) # Set up the DeSeq2 data set structure f <- DESeqDataSetFromMatrix(countData = groupcounts, colData = coldata, design= ~ condition) # Estimate the size factors. See DeSeq2 manual for details f <- estimateSizeFactors(f) # Size factors can be viewed by: sizeFactors(f) # Multiply each row (sample) by the corresponding size factor subcount_norm <- as.matrix(groupcounts) %*% diag(sizeFactors(f)) # Re-add column names colnames(subcount_norm) <- colnames(groupcounts) ## Remove low coverage transcripts (mean count < 10) ## # Find the mean of each row (and output as a data frame object) means <- as.data.frame(rowMeans(subcount_norm)) # Then join the means data with the counts means <- cbind(means, subcount_norm) # Then subset out only genes with mean > 10 data <- subset(means, means[ , 1] > 10) # Remove the means column data <- data[,-1] # Transform data data_log <- vst(round(as.matrix(data)), nsub = nrow(data)-20) # Transformation can create some infinite values. Can't generate PCA data on these. Can see how many by: sum(sapply(data_log, is.infinite)) # To remove infinite rows, use 'is.finte' or '!is.infinite' data_log <- data_log[is.finite(rowSums(data_log)),] colnames(data_log) <- colnames(groupcounts) ### Set up the PCA plot base data ### # We're using the FactoMineR package to generate PCA plots (http://factominer.free.fr/index.html) # Need to transpose the data first data_log_t <- t(data_log) # Add the group data data_log_t_vars <- data.frame(meta$group[meta$group %in% plotgroups], data_log_t) # Generate the PCA data using FactoMineR package res.pca <- PCA(data_log_t_vars, quali.sup = 1, graph=FALSE) ## Set up the dendogram/heatmaps base data ## # Calculate the distance matrix: distance_matrix <- as.matrix(dist(t(data_log))) #### 5a. PCA plot #### # Generate the PCA plot. Groups are shaded with ellipses at 95% confidence level. NOTE: at least 4 replicates need to be in a group for an ellipses to be drawn. # NOTE: change the group point colours by changing 'palette = ' below. Use the 'RColourBrewer' colour names (https://r-graph-gallery.com/38-rcolorbrewers-palettes.html). For example, if you are plotting 3 groups and choose palette = "Set1", this will use the first 3 colours from the Set1 colour palette. p <- fviz_pca_ind(res.pca, geom.ind = c("point", "text"), # show points only (but not "text") col.ind = meta$group[meta$group %in% plotgroups], # color by groups pointsize = 5, label = "all", title = "", legend.title = "Treatment groups", palette = "Dark2", addEllipses = TRUE, ellipse.type = "t", ellipse.level = 0.95) + theme(legend.text = element_text(size = 12), legend.title = element_text(size = 14), axis.title=element_text(size=16), axis.text=element_text(size=14)) p # Output as publication quality (300dpi) tiff and pdf. # This will name your output files with the treatment groups you selected. # Create a 'results_outliers_removed' subdirectory where all results_outliers_removed will be output dir.create("results_outliers_removed", showWarnings = FALSE) # Create a (300dpi) tiff ggsave(file = paste0("./results_outliers_removed/PCA_", paste(plotgroups, collapse = "_Vs_"), ".tiff"), dpi = 300, compression = "lzw", device = "tiff", width = 10, height = 8, plot = p) # Create a pdf ggsave(file = paste0("./results_outliers_removed/PCA_", paste(plotgroups, collapse = "_Vs_"), ".pdf"), device = "pdf", width = 10, height = 8, plot = p) #### 5b. Samples heatmap and dendrogram #### # This section plots a heatmap and dendrogram of pairwise relationships between samples. In this way you can see if samples cluster by treatment group. # See here: https://davetang.org/muse/2018/05/15/making-a-heatmap-in-r-with-the-pheatmap-package/ # Define annotation column annot_columns <- data.frame(meta$group[meta$group %in% plotgroups]) # Make the row names the sample IDs row.names(annot_columns) <- meta$sample_ID[meta$group %in% plotgroups] colnames(annot_columns) <- "Treatment groups" # Need to factorise it annot_columns[[1]] <- factor(annot_columns[[1]]) # Generate dendrogram and heatmap pheatmap(distance_matrix, color=colorRampPalette(c("white", "#9999FF", "#990000"))(50), cluster_rows = TRUE, show_rownames = TRUE, treeheight_row = 0, treeheight_col = 70, fontsize_col = 12, annotation_names_col = F, annotation_col = annot_columns, filename = paste0("./results_outliers_removed/Pairwise_sample_heatmap_", paste(plotgroups, collapse = "_Vs_"), ".tiff")) # Notes about heatmap colours. # You can change the colours used in the heatmap itself by changing the colour names (color=colorRampPalette....) # If you want to change the annotation colours, see here: https://zhiganglu.com/post/pheatmap_change_annotation_colors/ #### 6. Differential expression analysis #### # In this section we use the Deseq2 package to identify differentially expressed genes. ## USER INPUT # Choose the treatment groups you want to compare. # To see what groups are present, run the following: unique(meta$group) # Enter which groups you want to compare (two groups only). BASELINE OR CONTROL GROUP SHOULD BE LISTED FIRST. degroups <- c("normal", "Huntingtons_disease") # From the count table, pull out only the counts from the above groups expdata <- as.matrix(counts[,meta$group %in% degroups]) # Set up the experimental condition # 'factor' sets up the reference level, i.e. which is the baseline group (otherwise the default baseline level is in alphabetic order) condition <- factor(meta$group[meta$group %in% degroups], levels = degroups) # Type 'condition' in the console to see is the levels are set correctly # Set up column data (treatment groups and sample ID) coldata <- data.frame(row.names=colnames(expdata), condition) # Create the DESeq2 dataset (dds) dds <- DESeq2::DESeqDataSetFromMatrix(countData=expdata, colData=coldata, design=~condition) dds$condition <- factor(dds$condition, levels = degroups) # Run DESeq2 to identify differentially expressed genes deseq <- DESeq(dds) # Extract a results table from the DESeq analysis res <- results(deseq) # Reorder results by adjusted p vales, so that the most signififcantly DE genes are at the top res <- res[order(res$padj), ] # You can do a summary of the results to see how many significantly (alpha=0.05, adjust to 0.01 if needed) upregulated and downregulated DE genes were found summary(res, alpha=0.05) # Convert from DESeq object to a data frame. res <- data.frame(res) # Look at the top 6 DE genes head(res) # Add normalised counts to the output table. This is so you can later plot expression trends for individual genes in R, Excel, etc. # Need to normalise the counts first, using the size factors calculated by DESeq2 (in the 'deseq' object) expdata_norm <- as.matrix(expdata) %*% diag(deseq$sizeFactor) colnames(expdata_norm) <- colnames(expdata) annot_counts <- merge(x = res, y = expdata_norm, by = 0, all = TRUE) # Pull out just significant genes (change from 0.05 to 0.01 if needed) DE_genes <- subset(annot_counts, padj < 0.05, select=colnames(annot_counts)) # Export as a csv table write.csv(DE_genes, file=paste0("./results_outliers_removed/DE_genes_", paste(degroups, collapse = "_Vs_"), ".csv"), row.names = FALSE) #### 6b. Volcano plot #### p <- EnhancedVolcano(res, lab = row.names(res), selectLab = row.names(res)[1:20], drawConnectors = TRUE, title = NULL, subtitle = NULL, x = 'log2FoldChange', y = 'pvalue') p <- EnhancedVolcano(res, lab = rownames(res), pointSize = 3, drawConnectors = TRUE, title = NULL, subtitle = NULL, x = 'log2FoldChange', y = 'pvalue') p # NOTE: the above plot shows labels for the top significantly DE (i.e. by lowest adjusted p value) genes. # Output as publication quality (300dpi) tiff and pdf. # Create a (300dpi) tiff ggsave(file = paste0("./results_outliers_removed/volcano_", paste(degroups, collapse = "_Vs_"), ".tiff"), dpi = 300, compression = "lzw", device = "tiff", width = 10, height = 8, plot = p) # Create a pdf ggsave(file = paste0("./results_outliers_removed/volcano_", paste(degroups, collapse = "_Vs_"), ".pdf"), device = "pdf", width = 10, height = 8, plot = p) #### 6c. DE genes heatmaps and dendrograms #### # sort by p-value DE_genes <- DE_genes[order(DE_genes$padj), ] row.names(DE_genes) <- DE_genes$Row.names # Pull out normalised counts only siggc <- DE_genes[colnames(DE_genes) %in% colnames(expdata)] # Scale and center each row. This is important to visualise relative differences between groups and not have row-wise colouration dominated by high or low gene expression. xts <- scale(t(siggc)) xtst <- t(xts) # Define annotation column annot_columns <- data.frame(meta$group[meta$group %in% degroups]) # Make the row names the sample IDs row.names(annot_columns) <- meta$sample_ID[meta$group %in% degroups] colnames(annot_columns) <- "Treatment groups" # Need to factorise it annot_columns[[1]] <- factor(annot_columns[[1]]) # Generate dendrogram and heatmap for ALL DE genes pheatmap(xtst, color=colorRampPalette(c("#D55E00", "white", "#0072B2"))(100), annotation_col=annot_columns, annotation_names_col = F, fontsize_col = 12, fontsize_row = 7, labels_row = row.names(siggc), show_rownames = T, filename = paste0("./results_outliers_removed/All_DEG_Heatmap_", paste(plotgroups, collapse = "_Vs_"), ".tiff"))

Related content