/
2024-2: 5a.4 Identifying differentially expressed (DE) genes

2024-2: 5a.4 Identifying differentially expressed (DE) genes

Identify differentially expressed (DE) genes

Now we come to the main analysis section of this workflow, where we will identify differentially expressed genes. This will generate two important datapoints per gene, the log fold change, which shows the change in expression levels between two treatment groups for a specific gene, and the adjusted p value, which shows which genes are significantly differentially expressed (adjusted to remove false positives).

The R package we use to find DE genes is DESeq2, “Differential gene expression analysis based on the negative binomial distribution”.

DESeq2 estimates variance-mean dependence in count data from high-throughput sequencing assays and tests for differential expression based on a model using the negative binomial distribution.

Copy and run the below code into your R script.

You’ll need to choose two treatment groups you want to compare (degroups <-), from the list of treatment groups in your samples table.

#### 6. Differential expression analysis #### # In this section we use the Deseq2 package to identify differentially expressed genes. ## USER INPUT # Choose the treatment groups you want to compare. # To see what groups are present, run the following: unique(meta$group) # Enter which groups you want to compare (two groups only). BASELINE OR CONTROL GROUP SHOULD BE LISTED FIRST. #degroups <- c("Basal_cells", "Murine_tracheal_epithelial_cell") degroups <- c("Basal_cells", "Differentiated_cells") # From the count table, pull out only the counts from the above groups expdata <- as.matrix(counts[,meta$group %in% degroups]) # Set up the experimental condition # 'factor' sets up the reference level, i.e. which is the baseline group (otherwise the default baseline level is in alphabetic order) condition <- factor(meta$group[meta$group %in% degroups], levels = degroups) # Type 'condition' in the console to see is the levels are set correctly # Set up column data (treatment groups and sample ID) coldata <- data.frame(row.names=colnames(expdata), condition) # Create the DESeq2 dataset (dds) dds <- DESeq2::DESeqDataSetFromMatrix(countData=expdata, colData=coldata, design=~condition) dds$condition <- factor(dds$condition, levels = degroups) # Run DESeq2 to identify differentially expressed genes deseq <- DESeq(dds) # Extract a results table from the DESeq analysis res <- results(deseq) # Reorder results by adjusted p vales, so that the most signififcantly DE genes are at the top res <- res[order(res$padj), ] # You can do a summary of the results to see how many significantly (alpha=0.05, adjust to 0.01 if needed) upregulated and downregulated DE genes were found summary(res, alpha=0.05) # Convert from DESeq object to a data frame. res <- data.frame(res) # Look at the top 6 DE genes head(res)

4a. Annotating your DE genes

The table of DE genes will have gene names according to the reference genome that was used to generate the count table. This is usually an Entrez gene ID if the NCBI genome was used, or an Ensemble gene ID if the Ensemble genome was used.

These Entrez/Ensembl gene IDs are a string of numbers that aren’t particularly informative. They can be individually looked at to see what genes they represent, or we can annotate them all at the same time in R to match the Entrez/Ensembl gene IDs to their more commonly known gene symbol and description.

We'll be using bioconductor genome wide annotation packages to provide the annotation data. https://bioconductor.org/packages/3.17/data/annotation/

Annotation packages for human (org.Hs.eg.db), mouse (org.Mm.eg.db), rat (org.Rn.eg.db), E. coli strain K12 (org.EcK12.eg.db), E. coli strain Sakai (org.EcSakai.eg.db), zebrafish (org.Dr.eg.db) and Drosophila (org.Dm.eg.db) were installed with the required R packages. If your species is not in the above list, contact the eResearch team.

This workshop uses mouse data, so we will be using the org.Mm.eg.db annotation package here. Copy and paste this section of code into R.

#### 6a. Annotating your DE genes #### # Annotation packages for human (org.Hs.eg.db), mouse (org.Mm.eg.db), rat (org.Rn.eg.db), E. coli strain K12 (org.EcK12.eg.db), E. coli strain Sakai (org.EcSakai.eg.db), zebrafish (org.Dr.eg.db) and Drosophila (org.Dm.eg.db) were installed with the required R packages. If your species is not in the above list, contact the eResearch team. ## USER INPUT # Input your species genome below (select from the above list) my_genome <- org.Mm.eg.db # Pull out just the Entrez/Ensembl gene IDs gene_ids <- row.names(res) # You can see the list of annotations you can apply to your data by: keytypes(my_genome) # Annotate gene symbol and description to your DE gene IDs (you can add other keytypes from the above 'keytypes(my_genome)' list, if you choose) cols <- c("SYMBOL", "GENENAME") ## USER INPUT # Provide the gene ID type for your DE data # If you have Ensemble IDs, enter "ENSEMBL", if you have Entrez IDs, enter "ENTREZID", if you have gene symbols, enter "SYMBOL" idtype <- "ENSEMBL" # Map the Entrez/Ensembl gene IDs to gene symbol and description map <- AnnotationDbi::select(my_genome, keys=gene_ids, columns=cols, keytype=idtype) # Combine the annotation data with the DE data # Since we're matching Ensembl -> Entrez there aren't a 1:1 mapping, so need to merge rather than cbind annot <- merge(x = res, y = map, by.x = 0, by.y = idtype, all = F) # There isn't always a 1:1 mapping between gene identifiers, so we also need to remove duplicates annot_nodups <- annot[!duplicated(annot$Row.names), ] # Reorder by adjusted p annot_nodups <- annot_nodups[order(annot_nodups$padj), ] # Add normalised counts to the output table. This is so you can later plot expression trends for individual genes in R, Excel, etc. # Need to normalise the counts first, using the size factors calculated by DESeq2 (in the 'deseq' object) expdata_norm <- as.matrix(expdata) %*% diag(deseq$sizeFactor) colnames(expdata_norm) <- colnames(expdata) annot_counts <- merge(x = annot_nodups, y = expdata_norm, by.x = "Row.names", by.y = 0, all = TRUE) # Export the full annotated dataset as a csv (for journal submission) # Creates a 'Tables' subdirectory where all tables will be output dir.create("Tables", showWarnings = FALSE) write.csv(annot_counts, file=paste0("./Tables/all_genes_", paste(degroups, collapse = "_Vs_"), ".csv"), row.names = FALSE) # Pull out just significant genes (change from 0.05 to 0.01 if needed) DE_genes <- subset(annot_counts, padj < 0.05, select=colnames(annot_counts)) # NOTE: add lfc cutoffs if needed. E.g., log2FoldChange > 1 and < log2FoldChange -1 cutoff # DE_genes <- subset(DE_genes, log2FoldChange > 1 | log2FoldChange < -1, select=colnames(DE_genes)) # Export the sig DE genes as a table (to be used in downstream analysis) write.csv(DE_genes, file=paste0("./Tables/DE_genes_", paste(degroups, collapse = "_Vs_"), ".csv"), row.names = FALSE)

4b. Volcano plot

Now that we have our list of DE genes, we can visualise them with a volcano plot and heat map.

First, we’ll generate a volcano plot by copying and pasting this section of code into R.

We’re using the EnhancedVolcano package ‘Publication-ready volcano plots with enhanced colouring and labeling’

The below plot only labels the top 20 DE genes, using mostly plot defaults. You can modify this code to label whatever genes you like, change plot colours, change axis labels, change log fold change cutoffs, etc, by following the EnhancedVolcano guide here:https://bioconductor.org/packages/release/bioc/html/EnhancedVolcano.html

#### 6b. Volcano plot #### p <- EnhancedVolcano(annot_nodups, lab = annot_nodups$SYMBOL, selectLab = annot_nodups$SYMBOL[1:20], drawConnectors = TRUE, title = NULL, subtitle = NULL, x = 'log2FoldChange', y = 'pvalue') p # NOTE: the above plot shows labels for the top significantly DE (i.e. by lowest adjusted p value) genes. # Output as publication quality (300dpi) tiff and pdf. # Create a (300dpi) tiff ggsave(file = paste0("./figures/volcano_", paste(degroups, collapse = "_Vs_"), ".tiff"), dpi = 300, compression = "lzw", device = "tiff", width = 10, height = 8, plot = p) # Create a pdf ggsave(file = paste0("./figures/volcano_", paste(degroups, collapse = "_Vs_"), ".pdf"), device = "pdf", width = 10, height = 8, plot = p) # Create a png ggsave(file = paste0("./figures/volcano_", paste(degroups, collapse = "_Vs_"), ".png"), device = "png", width = 10, height = 8, plot = p)
volcano_Basal_cells_Vs_Differentiated_cells.png
Figure 3: Volcano plot showing how differentially expressed genes between treatment groups.

4c. DE genes heatmap

This section plots a heatmap and dendrogram of DE gene expression per sample. Expression counts are scaled and centered so that groupwise relationships can be examined. Copy and paste this section of code into R.

 

All_DEG_Heatmap_Differentiated_cells_Vs_Basal_cells.png
Figure 4. Heat map showing Basal cells vs Differentiated cells for all genes

 


Related content

2024-2: 5a-Introduction - Differential Expression (DE) using DESeq2
2024-2: 5a-Introduction - Differential Expression (DE) using DESeq2
More like this
2024-2: 5a.3 Checking for outliers and batch effects
2024-2: 5a.3 Checking for outliers and batch effects
More like this
2024-2: 5a.1 Preparing your data for DE
2024-2: 5a.1 Preparing your data for DE
Read with this
eResearch - Session 3 : Differential expression analysis using R
eResearch - Session 3 : Differential expression analysis using R
More like this
2024-2: 5b.4 GO (Gene ontology) term enrichment
2024-2: 5b.4 GO (Gene ontology) term enrichment
Read with this
Task 4 : Differential expression analysis using R
Task 4 : Differential expression analysis using R
More like this