Functional annotation

 

1. What is functional annotation?

Many types of genetic analysis will output a set of genes that are associated with a specific experimental condition. The classic example of this is RNA-Seq, which outputs a set of genes that are differentially expressed between experimental conditions. But micro RNA, epigenetics (e.g. differential methylation), variant calling and various other analysis types can also generate a set of condition-based genes.

Functional annotation uses a set of genes (such as differentially expressed genes) to examine enrichment of these genes in Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and Gene Ontology (GO) terms.

KEGG

.. is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from genomic and molecular-level information. It is a computer representation of the biological system, consisting of molecular building blocks of genes and proteins (genomic information) and chemical substances (chemical information) that are integrated with the knowledge on molecular wiring diagrams of interaction, reaction and relation networks (systems information). It also contains disease and drug information (health information) as perturbations to the biological system.

GO

.. provides a computational representation of our current scientific knowledge about the functions of genes (or, more properly, the protein and non-coding RNA molecules produced by genes) from many different organisms, from humans to bacteria. It is widely used to support scientific research, and has been cited in tens of thousands of publications.

Understanding gene function—how individual genes contribute to the biology of an organism at the molecular, cellular and organism levels—is one of the primary aims of biomedical research. Moreover, experimental knowledge obtained in one organism is often applicable to other organisms, particularly if the organisms share the relevant genes because they inherited them from their common ancestor.

Associations of gene products to GO terms are statements that describe

Molecular Function: the molecular activities of individual gene products

Cellular Component: where the gene products are active

Biological Process: the pathways and larger processes to which that gene product’s activity contributes

 

2. R Packages

We’ll be using two main R packages:

Functional enrichment for KEGG pathways and GO terms was completed using the package https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html

You can read more about clusterProfiler’s statistical and analysis methods here: https://yulab-smu.top/biomedical-knowledge-mining-book/index.html

Annotated KEGG pathway maps are generated using the package https://www.bioconductor.org/packages/release/bioc/html/pathview.html

 

3. Connect to an rVDI virtual desktop machine

As with the previous differential expression analyses we did in sessions 3 and 4, we will also be running this analysis in RStudio on an rVDI virtual machine. The reason is the same as before - to save time as the required R packages are pre-installed on these virtual machines. And, as before, you can copy and paste this script to RStudio on your local computer and adapt it to your own dataset.

To access and run an rVDI virtual desktop:

Go to https://rvdi.qut.edu.au/

Click on ‘VMware Horizon HTML Access

Log on with your QUT username and password

*NOTE: you need to be connected to the QUT network first, either being on campus or connecting remotely via VPN.

 

4. Preparing your data

We’ll be using the RNA-Seq differential expression results you generated in session 3.

These will be in a file called ‘DE_genes_Basal_cells_Vs_Differentiated_cells.csv’ in the ‘Table' output folder from session 3. This file contains a list of differentially expressed genes that you generated using DESeq2. This list of DE genes will be used as input for functional annotation.

 

a. In windows explorer, go to: H:\workshop\RNAseq

 

b. In this folder, create a new folder called ‘functional_annotation’ (case-sensitive)

 

c. Open RStudio and create a new R script (‘File’ → “New File” → “R script”). Now hit ‘File’ → ‘Save’ and save the script in the H:\workshop\RNAseq\functional_annotation folder you created. Save the script file as ‘functional_annotation.R

 

In the following sections you will be copying and running the R code into your functional_annotation.R script.

 

5. Installing packages

IMPORTANT: don’t run this code in this rVDI session, the packages are already installed. This code is only here if you need to run the analysis on your computer.

#### 5. Installing required packages ####

 

bioconductor_packages <- c("clusterProfiler", "pathview", "AnnotationHub", "org.Mm.eg.db")

cran_packages <- c("tidyverse", "ggplot2", "plyr", "readxl", "scales")

# Compares installed packages to above packages and returns a vector of missing packages

new_packages <- bioconductor_packages[!(bioconductor_packages %in% installed.packages()[,"Package"])]

new_cran_packages <- cran_packages[!(cran_packages %in% installed.packages()[,"Package"])]

# Install missing bioconductor packages

if (!requireNamespace("BiocManager", quietly = TRUE))

  install.packages("BiocManager")

BiocManager::install(new_packages)

# Install missing cran packages

if (length(new_cran_packages)) install.packages(new_cran_packages, repos = http://cran.us.r-project.org)

# Update all installed packages to the latest version

update.packages(bioconductor_packages, ask = FALSE)

update.packages(cran_packages, ask = FALSE, repos = http://cran.us.r-project.org)

#### 5. Installing required packages ####

 

bioconductor_packages <- c("clusterProfiler", "pathview", "AnnotationHub", "org.Mm.eg.db")

cran_packages <- c("tidyverse", "ggplot2", "plyr", "readxl", "scales")

# Compares installed packages to above packages and returns a vector of missing packages

new_packages <- bioconductor_packages[!(bioconductor_packages %in% installed.packages()[,"Package"])]

new_cran_packages <- cran_packages[!(cran_packages %in% installed.packages()[,"Package"])]

# Install missing bioconductor packages

if (!requireNamespace("BiocManager", quietly = TRUE))

  install.packages("BiocManager")

BiocManager::install(new_packages)

# Install missing cran packages

if (length(new_cran_packages)) install.packages(new_cran_packages, repos = http://cran.us.r-project.org)

# Update all installed packages to the latest version

update.packages(bioconductor_packages, ask = FALSE)

update.packages(cran_packages, ask = FALSE, repos = http://cran.us.r-project.org)

6. Loading packages

Copy and paste (then run) this code into your R script (same with the code in all following sections as well).

#### 6. Loading required packages #### # This section needs to be run every time # Load packages bioconductor_packages <- c("clusterProfiler", "pathview", "AnnotationHub", "org.Mm.eg.db") cran_packages <- c("tidyverse", "ggplot2", "plyr", "readxl", "scales") lapply(cran_packages, require, character.only = TRUE) lapply(bioconductor_packages, require, character.only = TRUE)

 

7. Gene ID conversion

KEGG pathways are based on Entrez gene IDs, NCBIs numeric gene identifiers. Therefore, to run this workflow you need to convert your input gene IDs (in our case gene symbols) into Entrez IDs

NOTE: You need to have the correct path for your DE_genes_Basal_cells_Vs_Differentiated_cells.csv file, that you generated in session 3. It should be in H:/workshop/RNAseq/DE_analysis_workshop/Tables for most people. If not, change this to the correct path containing your DE genes table.

If you’ve deleted your DE_genes_Basal_cells_Vs_Differentiated_cells.csv file or didn’t attend section 3, you can download the file here , create the H:/workshop/RNAseq/DE_analysis_workshop/Tables folders and put the file in the Tables folder.

 

#### 7. Convert to Entrez gene IDs #### # Set your working directory setwd("H:/workshop/RNAseq/functional_annotation") # Import your DE genes data: # NOTE: THE DIRECTORY THAT CONTAINS YOUR RESULTS DATA MAY VARY. YOU NEED TO LOOK IN 'H:/workshop/RNAseq' TO SEE WHERE YOUR 'Tables' SUBDIRECTORY IS, AND CHANGE THE BELOW PATH TO REFLECT THAT. dat <- read.csv("H:/workshop/RNAseq/DE_analysis_workshop/Tables/DE_genes_Basal_cells_Vs_Differentiated_cells.csv", row.names = 1) # Convert the gene symbol column to Entrez IDs # NOTE: 'OrgDb=' needs to be your organism database. We're using mouse here, so we're using 'OrgDb=org.Mm.eg.db' if your species was human you'd use 'OrgDb=org.Hs.eg.db', etc gene_list <- bitr(gene = dat$SYMBOL, fromType="SYMBOL", toType="ENTREZID", OrgDb=org.Mm.eg.db, drop=TRUE)

 

8. KEGG pathway enrichment

Now we can run the clusterProfiler function enrichKEGG to match your set of Entrez IDs to the KEGG pathways for your species.

NOTE: our data is from mouse, so we’re using the KEGG species ID for mouse (organism = "mmu"). The list of KEGG species IDs is here (e.g. if your data is human, you need to change this to organism = "hsu") :

https://www.genome.jp/kegg/catalog/org_list.html

#### 8. KEGG pathway enrichment #### # Use clusterProfiler's enrichKEGG() function to match your genes list to KEGG pathways kk <- enrichKEGG(gene = gene_list$ENTREZID, organism = "mmu", pvalueCutoff = 0.05, qvalueCutoff = 0.2) # Now you can save this as a table of enriched KEGG pathways # Create a 'results' subdirectory where all figures will be output dir.create("results", showWarnings = FALSE) write.csv(as.data.frame(kk), "./results/Enriched_KEGG_pathways.csv")

 

9. Plotting enriched KEGG pathways

Here we’ll plot the enriched KEGG pathways. We’ll generate 3 plots (bar plot, dot plot, network plot) to visualise the same results in 3 different ways. This gives you options to present the data how you choose.

All of these plotting functions are part of the clusterProfiler package and can be modified to change the number of pathways shown and also change a wide array of other plot characteristics, such as plot colours, labels, legend, etc. More details can be seen here: https://yulab-smu.top/biomedical-knowledge-mining-book/index.html

 

 

10. KEGG pathway maps

The package pathview can pull down KEGG pathway maps from the KEGG database and annotate these with your input genes. It shades the genes by colour and intensity (e.g. using log fold change data, upregulated genes can be coloured green and downregulated coloured red).

NOTE: this only plots one of your enriched pathways at a time. You’ll need to enter one of these KEGG pathways IDs into into the keggpath <- .. line. You can see your set of enriched pathways in the H:\workshop\RNAseq\functional_annotation\results\Enriched_KEGG_pathways.csv or type as.data.frame(kk)$ID into the RStudio console.

You can change the KEGG pathway to a different pathway and re-run this for as many of your pathways as you like.

 

 

11. GO term enrichment

GO terms are functional groupings of genes. We can look at enrichment of GO terms in a similar way as we did with KEGG pathways.

NOTE: GO terms are hierarchical, necessitating that we initially provide one of three top level terms in ont <- "..": these three terms are Biological Processes (ont <- "BP") Molecular Functions (ont <- "MF") and Cellular Components (ont <- "CC"). We should run the two following code blocks (enrichment and plotting) three times, providing each ont <- "..".

 

 

12. Plotting enriched GO terms

As with the KEGG pathways, we can plot a bar plot, dot plot and network plot.

In addition, we’re adding a fourth plot type here: a directed acyclic graph.

As previously mentioned, GO terms are hierarchical, with individual terms being linked to parent or child levels. The previous plots combine all levels under each of three top level terms. A directed acyclic graph allows a visualisation of how the enriched GO terms fit into the GO hierarchy.