2024-2: 5b-Introduction - Functional Annotation (FA)
What is functional annotation?
Many types of genetic analysis will output a set of genes that are associated with a specific experimental condition. The classic example of this is RNA-Seq, which outputs a set of genes that are differentially expressed between experimental conditions. But micro RNA, epigenetics (e.g. differential methylation), variant calling, and various other analysis types can also generate a set of condition-based genes.
Functional annotation uses a set of genes (such as differentially expressed genes) to examine enrichment of these genes in Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and Gene Ontology (GO) terms.
KEGG
.. is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from genomic and molecular-level information. It is a computer representation of the biological system, consisting of molecular building blocks of genes and proteins (genomic information) and chemical substances (chemical information) that are integrated with the knowledge on molecular wiring diagrams of interaction, reaction and relation networks (systems information). It also contains disease and drug information (health information) as perturbations to the biological system.
GO
.. provides a computational representation of our current scientific knowledge about the functions of genes (or, more properly, the protein and non-coding RNA molecules produced by genes) from many different organisms, from humans to bacteria. It is widely used to support scientific research, and has been cited in tens of thousands of publications.
Understanding gene function—how individual genes contribute to the biology of an organism at the molecular, cellular and organism levels—is one of the primary aims of biomedical research. Moreover, experimental knowledge obtained in one organism is often applicable to other organisms, particularly if the organisms share the relevant genes because they inherited them from their common ancestor.
Associations of gene products to GO terms are statements that describe
Molecular Function: the molecular activities of individual gene products
Cellular Component: where the gene products are active
Biological Process: the pathways and larger processes to which that gene product’s activity contributes
R Packages
We’ll be using two main R packages:
Functional enrichment for KEGG pathways and GO terms was completed using the package https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html
You can read more about clusterProfiler’s statistical and analysis methods here: https://yulab-smu.top/biomedical-knowledge-mining-book/index.html
Annotated KEGG pathway maps are generated using the package https://www.bioconductor.org/packages/release/bioc/html/pathview.html
Connect to an rVDI virtual desktop machine
As with the previous differential expression analyses we did earlier, we will also be running this analysis in RStudio on an rVDI virtual machine. The reason is the same as before - to save time as the required R packages are pre-installed on these virtual machines. And, as before, you can copy and paste this script to RStudio on your local computer and adapt it to your own dataset.
Overview of FA section - We will now perform the following tasks using Rstudio
Preparing your data for Functional Annotation analysis. Only one data file is needed for this analysis: a differentially expressed gene table from earlier
R packages
Installing required R packages (only need to run once) - after installation, we only need to load the packages. NOTE: If using an rVDI virtual machine, the R packages are already installed
Loading required R packages. Unlike installing the packages, this needs to be done every time you run the analysis
KEGG pathway enrichment
Gene ID conversion
KEGG pathway enrichment
Plotting enriched KEGG pathways
KEGG pathway maps
GO term enrichment
GO term enrichment
Plotting enriched GO terms