2024-2: 5a.4 Identifying differentially expressed (DE) genes
Identify differentially expressed (DE) genes
Now we come to the main analysis section of this workflow, where we will identify differentially expressed genes. This will generate two important datapoints per gene, the log fold change, which shows the change in expression levels between two treatment groups for a specific gene, and the adjusted p value, which shows which genes are significantly differentially expressed (adjusted to remove false positives).
The R package we use to find DE genes is DESeq2, “Differential gene expression analysis based on the negative binomial distribution”.
DESeq2 estimates variance-mean dependence in count data from high-throughput sequencing assays and tests for differential expression based on a model using the negative binomial distribution.
Copy and run the below code into your R script.
You’ll need to choose two treatment groups you want to compare (degroups <-
), from the list of treatment groups in your samples table.
#### 6. Differential expression analysis ####
# In this section we use the Deseq2 package to identify differentially expressed genes.
## USER INPUT
# Choose the treatment groups you want to compare.
# To see what groups are present, run the following:
unique(meta$group)
# Enter which groups you want to compare (two groups only). BASELINE OR CONTROL GROUP SHOULD BE LISTED FIRST.
#degroups <- c("Basal_cells", "Murine_tracheal_epithelial_cell")
degroups <- c("Basal_cells", "Differentiated_cells")
# From the count table, pull out only the counts from the above groups
expdata <- as.matrix(counts[,meta$group %in% degroups])
# Set up the experimental condition
# 'factor' sets up the reference level, i.e. which is the baseline group (otherwise the default baseline level is in alphabetic order)
condition <- factor(meta$group[meta$group %in% degroups], levels = degroups)
# Type 'condition' in the console to see is the levels are set correctly
# Set up column data (treatment groups and sample ID)
coldata <- data.frame(row.names=colnames(expdata), condition)
# Create the DESeq2 dataset (dds)
dds <- DESeq2::DESeqDataSetFromMatrix(countData=expdata, colData=coldata, design=~condition)
dds$condition <- factor(dds$condition, levels = degroups)
# Run DESeq2 to identify differentially expressed genes
deseq <- DESeq(dds)
# Extract a results table from the DESeq analysis
res <- results(deseq)
# Reorder results by adjusted p vales, so that the most signififcantly DE genes are at the top
res <- res[order(res$padj), ]
# You can do a summary of the results to see how many significantly (alpha=0.05, adjust to 0.01 if needed) upregulated and downregulated DE genes were found
summary(res, alpha=0.05)
# Convert from DESeq object to a data frame.
res <- data.frame(res)
# Look at the top 6 DE genes
head(res)
4a. Annotating your DE genes
The table of DE genes will have gene names according to the reference genome that was used to generate the count table. This is usually an Entrez gene ID if the NCBI genome was used, or an Ensemble gene ID if the Ensemble genome was used.
These Entrez/Ensembl gene IDs are a string of numbers that aren’t particularly informative. They can be individually looked at to see what genes they represent, or we can annotate them all at the same time in R to match the Entrez/Ensembl gene IDs to their more commonly known gene symbol and description.
We'll be using bioconductor genome wide annotation packages to provide the annotation data. https://bioconductor.org/packages/3.17/data/annotation/
Annotation packages for human (org.Hs.eg.db), mouse (org.Mm.eg.db), rat (org.Rn.eg.db), E. coli strain K12 (org.EcK12.eg.db), E. coli strain Sakai (org.EcSakai.eg.db), zebrafish (org.Dr.eg.db) and Drosophila (org.Dm.eg.db) were installed with the required R packages. If your species is not in the above list, contact the eResearch team.
This workshop uses mouse data, so we will be using the org.Mm.eg.db annotation package here. Copy and paste this section of code into R.
#### 6a. Annotating your DE genes ####
# Annotation packages for human (org.Hs.eg.db), mouse (org.Mm.eg.db), rat (org.Rn.eg.db), E. coli strain K12 (org.EcK12.eg.db), E. coli strain Sakai (org.EcSakai.eg.db), zebrafish (org.Dr.eg.db) and Drosophila (org.Dm.eg.db) were installed with the required R packages. If your species is not in the above list, contact the eResearch team.
## USER INPUT
# Input your species genome below (select from the above list)
my_genome <- org.Mm.eg.db
# Pull out just the Entrez/Ensembl gene IDs
gene_ids <- row.names(res)
# You can see the list of annotations you can apply to your data by:
keytypes(my_genome)
# Annotate gene symbol and description to your DE gene IDs (you can add other keytypes from the above 'keytypes(my_genome)' list, if you choose)
cols <- c("SYMBOL", "GENENAME")
## USER INPUT
# Provide the gene ID type for your DE data
# If you have Ensemble IDs, enter "ENSEMBL", if you have Entrez IDs, enter "ENTREZID", if you have gene symbols, enter "SYMBOL"
idtype <- "ENSEMBL"
# Map the Entrez/Ensembl gene IDs to gene symbol and description
map <- AnnotationDbi::select(my_genome, keys=gene_ids, columns=cols, keytype=idtype)
# Combine the annotation data with the DE data
# Since we're matching Ensembl -> Entrez there aren't a 1:1 mapping, so need to merge rather than cbind
annot <- merge(x = res, y = map, by.x = 0, by.y = idtype, all = F)
# There isn't always a 1:1 mapping between gene identifiers, so we also need to remove duplicates
annot_nodups <- annot[!duplicated(annot$Row.names), ]
# Reorder by adjusted p
annot_nodups <- annot_nodups[order(annot_nodups$padj), ]
# Add normalised counts to the output table. This is so you can later plot expression trends for individual genes in R, Excel, etc.
# Need to normalise the counts first, using the size factors calculated by DESeq2 (in the 'deseq' object)
expdata_norm <- as.matrix(expdata) %*% diag(deseq$sizeFactor)
colnames(expdata_norm) <- colnames(expdata)
annot_counts <- merge(x = annot_nodups, y = expdata_norm, by.x = "Row.names", by.y = 0, all = TRUE)
# Export the full annotated dataset as a csv (for journal submission)
# Creates a 'Tables' subdirectory where all tables will be output
dir.create("Tables", showWarnings = FALSE)
write.csv(annot_counts, file=paste0("./Tables/all_genes_", paste(degroups, collapse = "_Vs_"), ".csv"), row.names = FALSE)
# Pull out just significant genes (change from 0.05 to 0.01 if needed)
DE_genes <- subset(annot_counts, padj < 0.05, select=colnames(annot_counts))
# NOTE: add lfc cutoffs if needed. E.g., log2FoldChange > 1 and < log2FoldChange -1 cutoff
# DE_genes <- subset(DE_genes, log2FoldChange > 1 | log2FoldChange < -1, select=colnames(DE_genes))
# Export the sig DE genes as a table (to be used in downstream analysis)
write.csv(DE_genes, file=paste0("./Tables/DE_genes_", paste(degroups, collapse = "_Vs_"), ".csv"), row.names = FALSE)
4b. Volcano plot
Now that we have our list of DE genes, we can visualise them with a volcano plot and heat map.
First, we’ll generate a volcano plot by copying and pasting this section of code into R.
We’re using the EnhancedVolcano package ‘Publication-ready volcano plots with enhanced colouring and labeling’
The below plot only labels the top 20 DE genes, using mostly plot defaults. You can modify this code to label whatever genes you like, change plot colours, change axis labels, change log fold change cutoffs, etc, by following the EnhancedVolcano guide here:https://bioconductor.org/packages/release/bioc/html/EnhancedVolcano.html
#### 6b. Volcano plot ####
p <- EnhancedVolcano(annot_nodups, lab = annot_nodups$SYMBOL, selectLab = annot_nodups$SYMBOL[1:20], drawConnectors = TRUE, title = NULL, subtitle = NULL, x = 'log2FoldChange', y = 'pvalue')
p
# NOTE: the above plot shows labels for the top significantly DE (i.e. by lowest adjusted p value) genes.
# Output as publication quality (300dpi) tiff and pdf.
# Create a (300dpi) tiff
ggsave(file = paste0("./figures/volcano_", paste(degroups, collapse = "_Vs_"), ".tiff"), dpi = 300, compression = "lzw", device = "tiff", width = 10, height = 8, plot = p)
# Create a pdf
ggsave(file = paste0("./figures/volcano_", paste(degroups, collapse = "_Vs_"), ".pdf"), device = "pdf", width = 10, height = 8, plot = p)
# Create a png
ggsave(file = paste0("./figures/volcano_", paste(degroups, collapse = "_Vs_"), ".png"), device = "png", width = 10, height = 8, plot = p)
4c. DE genes heatmap
This section plots a heatmap and dendrogram of DE gene expression per sample. Expression counts are scaled and centered so that groupwise relationships can be examined. Copy and paste this section of code into R.