1. What is functional annotation?
Many types of genetic analysis will output a set of genes that are associated with a specific experimental condition. The classic example of this is RNA-Seq, which outputs a set of genes that are differentially expressed between experimental conditions. But micro RNA, epigenetics (e.g. differential methylation), variant calling and various other analysis types can also generate a set of condition-based genes.
Functional annotation uses a set of genes (such as differentially expressed genes) to examine enrichment of these genes in Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and Gene Ontology (GO) terms.
KEGG
.. is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from genomic and molecular-level information. It is a computer representation of the biological system, consisting of molecular building blocks of genes and proteins (genomic information) and chemical substances (chemical information) that are integrated with the knowledge on molecular wiring diagrams of interaction, reaction and relation networks (systems information). It also contains disease and drug information (health information) as perturbations to the biological system.
GO
.. provides a computational representation of our current scientific knowledge about the functions of genes (or, more properly, the protein and non-coding RNA molecules produced by genes) from many different organisms, from humans to bacteria. It is widely used to support scientific research, and has been cited in tens of thousands of publications.
Understanding gene function—how individual genes contribute to the biology of an organism at the molecular, cellular and organism levels—is one of the primary aims of biomedical research. Moreover, experimental knowledge obtained in one organism is often applicable to other organisms, particularly if the organisms share the relevant genes because they inherited them from their common ancestor.
Associations of gene products to GO terms are statements that describe
Molecular Function: the molecular activities of individual gene products
Cellular Component: where the gene products are active
Biological Process: the pathways and larger processes to which that gene product’s activity contributes
2. R Packages
We’ll be using two main R packages:
Functional enrichment for KEGG pathways and GO terms was completed using the package https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html
You can read more about clusterProfiler’s statistical and analysis methods here: https://yulab-smu.top/biomedical-knowledge-mining-book/index.html
Annotated KEGG pathway maps are generated using the package https://www.bioconductor.org/packages/release/bioc/html/pathview.html
3. Connect to an rVDI virtual desktop machine
As with the previous differential expression analyses we did in sessions 3 and 4, we will also be running this analysis in RStudio on an rVDI virtual machine. The reason is the same as before - to save time as the required R packages are pre-installed on these virtual machines. And, as before, you can copy and paste this script to RStudio on your local computer and adapt it to your own dataset.
To access and run an rVDI virtual desktop:
Go to https://rvdi.qut.edu.au/
Click on ‘VMware Horizon HTML Access’
Log on with your QUT username and password
*NOTE: you need to be connected to the QUT network first, either being on campus or connecting remotely via VPN.
4. Preparing your data
We’ll be using the RNA-Seq differential expression results you generated in session 3.
These will be in a file called ‘DE_genes_Basal_cells_Vs_Differentiated_cells.csv’ in the ‘Table' output folder from session 3. This file contains a list of differentially expressed genes that you generated using DESeq2. This list of DE genes will be used as input for functional annotation.
a. In windows explorer, go to: H:\workshop\RNAseq
b. In this folder, create a new folder called ‘functional_annotation’ (case-sensitive)
c. Open RStudio and create a new R script (‘File’ → “New File” → “R script”). Now hit ‘File’ → ‘Save’ and save the script in the H:\workshop\RNAseq\functional_annotation folder you created. Save the script file as ‘functional_annotation.R’
In the following sections you will be copying and running the R code into your functional_annotation.R script.
5. Installing packages
IMPORTANT: don’t run this code in this rVDI session, the packages are already installed. This code is only here if you need to run the analysis on your computer.
#### 5. Installing required packages ####
bioconductor_packages <- c("clusterProfiler", "pathview", "AnnotationHub", "org.Mm.eg.db") cran_packages <- c("tidyverse", "ggplot2", "plyr", "readxl", "scales") # Compares installed packages to above packages and returns a vector of missing packages new_packages <- bioconductor_packages[!(bioconductor_packages %in% installed.packages()[,"Package"])] new_cran_packages <- cran_packages[!(cran_packages %in% installed.packages()[,"Package"])] # Install missing bioconductor packages if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(new_packages) # Install missing cran packages if (length(new_cran_packages)) install.packages(new_cran_packages, repos = http://cran.us.r-project.org) # Update all installed packages to the latest version update.packages(bioconductor_packages, ask = FALSE) update.packages(cran_packages, ask = FALSE, repos = http://cran.us.r-project.org) |
---|
6. Loading packages
Copy and paste (then run) this code into your R script (same with the code in all following sections as well).
#### 6. Loading required packages #### # This section needs to be run every time # Load packages bioconductor_packages <- c("clusterProfiler", "pathview", "AnnotationHub", "org.Mm.eg.db") cran_packages <- c("tidyverse", "ggplot2", "plyr", "readxl", "scales") lapply(cran_packages, require, character.only = TRUE) lapply(bioconductor_packages, require, character.only = TRUE)
7. Gene ID conversion
KEGG pathways are based on Entrez gene IDs, NCBIs numeric gene identifiers. Therefore, to run this workflow you need to convert your input gene IDs (in our case gene symbols) into Entrez IDs
NOTE: You need to have the correct path for your DE_genes_Basal_cells_Vs_Differentiated_cells.csv
file, that you generated in session 3. It should be in H:/workshop/RNAseq/Tables
for most people. If not, change this to the correct path containing your DE genes table.
#### 7. Convert to Entrez gene IDs #### # Set your working directory setwd("H:/workshop/RNAseq/functional_annotation") # Import your DE genes data: # NOTE: THE DIRECTORY THAT CONTAINS YOUR RESULTS DATA MAY VARY. YOU NEED TO LOOK IN 'H:/workshop/RNAseq' TO SEE WHERE YOUR 'Tables' SUBDIRECTORY IS, AND CHANGE THE BLOW PATH TO REFLECT THAT. dat <- read.csv("H:/workshop/RNAseq/*/Tables/DE_genes_Basal_cells_Vs_Differentiated_cells.csv", row.names = 1) # Convert the gene symbol column to Entrez IDs # NOTE: 'OrgDb=' needs to be your organism database. We're using mouse here, so we're using 'OrgDb=org.Mm.eg.db' if your species was human you'd use 'OrgDb=org.Hs.eg.db', etc gene_list <- bitr(gene = dat$SYMBOL, fromType="SYMBOL", toType="ENTREZID", OrgDb=org.Mm.eg.db, drop=TRUE)
8. KEGG pathway enrichment
Now we can run the clusterProfiler function enrichKEGG
to match your set of Entrez IDs to the KEGG pathways for your species.
NOTE: our data is from mouse, so we’re using the KEGG species ID for mouse (organism = "mmu"
). The list of KEGG species IDs is here (e.g. if your data is human, you need to change this to organism = "hsu"
) :
https://www.genome.jp/kegg/catalog/org_list.html
#### 8. KEGG pathway enrichment ### # Use clusterProfiler's enrichKEGG() function to match your genes list to KEGG pathways kk <- enrichKEGG(gene = gene_list$ENTREZID, organism = "mmu", pvalueCutoff = 0.05, qvalueCutoff = 0.2) # Now you can save this as a table of enriched KEGG pathways # Create a 'results' subdirectory where all figures will be output dir.create("results", showWarnings = FALSE) write.csv(as.data.frame(kk), "./results/Enriched_KEGG_pathways.csv")
9. Plotting enriched KEGG pathways
Here we’ll plot the enriched KEGG pathways. We’ll generate 3 plots (bar plot, dot plot, network plot) to visualise the same results in 3 different ways. This gives you options to present the data how you choose.
All of these plotting functions are part of the clusterProfiler package and can be modified to change the number of pathways shown and also change a wide array of other plot characteristics, such as plot colours, labels, legend, etc. More details can be seen here: https://yulab-smu.top/biomedical-knowledge-mining-book/index.html
#### 9. Plotting KEGG results #### # Remove some text (' - Mus musculus (house mouse)') from each pathway description, for brevity kk@result$Description <- gsub(" - Mus musculus (house mouse)", "", kk@result$Description, fixed = T) # Create a barplot p <- barplot(kk, showCategory = 14, font.size = 14) p # Export as a 300dpi tiff tiff_exp <- "./results/KEGG_bar.tiff" ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") # Export as a pdf pdf_exp <- "./results/KEGG_bar.pdf" ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # Dot plot p <- dotplot(kk, showCategory = 8, font.size = 14) p # Export as a 300dpi tiff tiff_exp <- "./results/KEGG_dot.tiff" ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") # Export as a pdf pdf_exp <- "./results/KEGG_dot.pdf" ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # Network plot p <- cnetplot(kk, showCategory = 8, colorEdge = F, node_label = "category") p # Export as a 300dpi tiff tiff_exp <- "./results/KEGG_network.tiff" ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") # Export as a pdf pdf_exp <- "./results/KEGG_network.pdf" ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")
10. KEGG pathway maps
The package pathview can pull down KEGG pathway maps from the KEGG database and annotate these with your input genes. It shades the genes by colour and intensity (e.g. using log fold change data, upregulated genes can be coloured green and downregulated coloured red).
NOTE: this only plots one of your enriched pathways at a time. You’ll need to enter one of these KEGG pathways IDs into into the keggpath <- ..
line. You can see your set of enriched pathways in the H:\workshop\RNAseq\functional_annotation\results\Enriched_KEGG_pathways.csv or type as.data.frame(kk)$ID
into the RStudio console.
You can change the KEGG pathway to a different pathway and re-run this for as many of your pathways as you like.
#### 10. KEGG pathway maps #### # pathview requires a named vector as input, so we first will pull out your numeric data (e.g. log fold change data) and name this using your gene IDs. lfc <- dat$log2FoldChange names(lfc) <- dat$SYMBOL # Select the pathway you want to annotate from one of the significantly enriched pathways # To view these: as.data.frame(kk)$ID # Enter one of these pathways below keggpath <- "mmu04360" # Run the pathview function to generate the pathway map pview <- pathview(gene.data = lfc, pathway.id = keggpath, species = "mmu", gene.idtype = "SYMBOL", low = list(gene = "red"), mid = list(gene = "gray"), high = list(gene = "green"), out.suffix = "KEGG_pathway_map", limit = list(gene=max(abs(lfc)), cpd=1))
11. GO term enrichment
GO terms are functional groupings of genes. We can look at enrichment of GO terms in a similar way as we did with KEGG pathways.
NOTE: GO terms are hierarchical, necessitating that we initially provide one of three top level terms in ont <- ".."
: these three terms are Biological Processes (ont <- "BP"
) Molecular Functions (ont <- "MF"
) and Cellular Components (ont <- "CC"
). We should run the two following code blocks (enrichment and plotting) three times, providing each ont <- ".."
.
#### 11. GO term enrichment #### # Fist select which of the top level terms you want to examine. You can run the analysis for each top-level term by changing this to another term then re-running this code cell and then the following GO code cells. The three terms are "BP", "MF", and "CC". ont <- "BP" # Then run the enrichGO function gg <- enrichGO(gene = as.character(gene_list$ENTREZID), OrgDb = org.Mm.eg.db, ont = ont, pvalueCutoff = 0.05, qvalueCutoff = 0.2) # Now you can save this as a table of enriched GO terms write.csv(as.data.frame(gg), paste0("./results/Enriched_GO_terms_", ont, ".csv"))
12. Plotting enriched GO terms
As with the KEGG pathways, we can plot a bar plot, dot plot and network plot.
In addition, we’re adding a fourth plot type here: a directed acyclic graph.
As previously mentioned, GO terms are hierarchical, with individual terms being linked to parent or child levels. The previous plots combine all levels under each of three top level terms. A directed acyclic graph allows a visualisation of how the enriched GO terms fit into the GO hierarchy.
#### 12. Plotting GO results #### # Bar plot p <- barplot(gg, showCategory = 14, font.size = 14) p # Output as tiff tiff_exp <- paste0("./results/", ont, "_GO_bar.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") # Output as pdf pdf_exp <- paste0("./results/", ont, "_GO_bar.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # Dot plot p <- dotplot(gg, showCategory = 8, font.size = 14) p # Output as tiff tiff_exp <- paste0("./results/", ont, "_GO_dot.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") # Output as pdf pdf_exp <- paste0("./results/", ont, "_GO_dot.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # Network plot p <- cnetplot(gg, showCategory = 8, colorEdge = TRUE, node_label = "category") p # Output as tiff tiff_exp <- paste0("./results/", ont, "_GO_network.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") # Output as pdf pdf_exp <- paste0("./results/", ont, "_GO_network.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm") # Directed acyclic graph p <- goplot(gg, showCategory = 30) p tiff_exp <- paste0("./results/", ont, "_GO_DAG.tiff") ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm") # Output as pdf pdf_exp <- paste0("./results/", ont, "_GO_DAG.pdf") ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")