{"type":"doc","content":[{"type":"extension","attrs":{"layout":"default","extensionType":"com.atlassian.confluence.macro.core","extensionKey":"toc","parameters":{"macroParams":{"style":{"value":"none"}},"macroMetadata":{"macroId":{"value":"6ece9c35-6af0-45e8-96f8-85e00422eb1c"},"schemaVersion":{"value":"1"},"title":"Table of Contents"}},"localId":"8f85fc19-b40b-42de-8554-d4bc70780b40"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Identify differentially expressed (DE) genes","type":"text"}]},{"type":"paragraph","content":[{"text":"Now we come to the main analysis section of this workflow, where we will identify differentially expressed genes. This will generate two important datapoints per gene, the log fold change, which shows the change in expression levels between two treatment groups for a specific gene, and the adjusted p value, which shows which genes are significantly differentially expressed (adjusted to remove false positives).","type":"text"}]},{"type":"paragraph","content":[{"text":"The R package we use to find DE genes is ","type":"text"},{"text":"DESeq2, “Differential gene expression analysis based on the negative binomial distribution”","type":"text","marks":[{"type":"link","attrs":{"href":"http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html"}}]},{"text":".","type":"text"}]},{"type":"paragraph","content":[{"text":"DESeq2 estimates variance-mean dependence in count data from high-throughput sequencing assays and tests for differential expression based on a model using the negative binomial distribution.","type":"text"}]},{"type":"paragraph","content":[{"text":"Copy and run the below code into your R script.","type":"text","marks":[{"type":"backgroundColor","attrs":{"color":"#d3f1a7"}}]}]},{"type":"paragraph","content":[{"text":"You’ll need to choose two treatment groups you want to compare (","type":"text"},{"text":"degroups <-","type":"text","marks":[{"type":"code"}]},{"text":"), from the list of treatment groups in your samples table.","type":"text"}]},{"type":"codeBlock","content":[{"text":"#### 6. Differential expression analysis ####\n# In this section we use the Deseq2 package to identify differentially expressed genes.\n## USER INPUT\n# Choose the treatment groups you want to compare. \n# To see what groups are present, run the following:\nunique(meta$group)\n# Enter which groups you want to compare (two groups only). BASELINE OR CONTROL GROUP SHOULD BE LISTED FIRST.\n#degroups <- c(\"Basal_cells\", \"Murine_tracheal_epithelial_cell\")\ndegroups <- c(\"Basal_cells\", \"Differentiated_cells\")\n\n# From the count table, pull out only the counts from the above groups\nexpdata <- as.matrix(counts[,meta$group %in% degroups])\n# Set up the experimental condition\n# 'factor' sets up the reference level, i.e. which is the baseline group (otherwise the default baseline level is in alphabetic order)\ncondition <- factor(meta$group[meta$group %in% degroups], levels = degroups)\n# Type 'condition' in the console to see is the levels are set correctly\n# Set up column data (treatment groups and sample ID)\ncoldata <- data.frame(row.names=colnames(expdata), condition)\n\n# Create the DESeq2 dataset (dds)\ndds <- DESeq2::DESeqDataSetFromMatrix(countData=expdata, colData=coldata, design=~condition)\ndds$condition <- factor(dds$condition, levels = degroups)\n# Run DESeq2 to identify differentially expressed genes\ndeseq <- DESeq(dds)\n# Extract a results table from the DESeq analysis\nres <- results(deseq)\n# Reorder results by adjusted p vales, so that the most signififcantly DE genes are at the top\nres <- res[order(res$padj), ]\n# You can do a summary of the results to see how many significantly (alpha=0.05, adjust to 0.01 if needed) upregulated and downregulated DE genes were found\nsummary(res, alpha=0.05)\n# Convert from DESeq object to a data frame.\nres <- data.frame(res)\n# Look at the top 6 DE genes\nhead(res)\n","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"4a. Annotating your DE genes","type":"text"}]},{"type":"paragraph","content":[{"text":"The table of DE genes will have gene names according to the reference genome that was used to generate the count table. This is usually an Entrez gene ID if the NCBI genome was used, or an Ensemble gene ID if the Ensemble genome was used.","type":"text"}]},{"type":"paragraph","content":[{"text":"These Entrez/Ensembl gene IDs are a string of numbers that aren’t particularly informative. They can be individually looked at to see what genes they represent, or we can annotate them all at the same time in R to match the Entrez/Ensembl gene IDs to their more commonly known gene symbol and description.","type":"text"}]},{"type":"paragraph","content":[{"text":"We'll be using ","type":"text"},{"text":"bioconductor ","type":"text","marks":[{"type":"link","attrs":{"href":"https://www.bioconductor.org/"}}]},{"text":"genome wide annotation packages to provide the annotation data. ","type":"text"},{"type":"inlineCard","attrs":{"url":"https://bioconductor.org/packages/3.17/data/annotation/"}}]},{"type":"paragraph","content":[{"text":"Annotation packages for human (org.Hs.eg.db), mouse (org.Mm.eg.db), rat (org.Rn.eg.db), E. coli strain K12 (org.EcK12.eg.db), E. coli strain Sakai (org.EcSakai.eg.db), zebrafish (org.Dr.eg.db) and Drosophila (org.Dm.eg.db) were installed with the required R packages. If your species is not in the above list, contact the eResearch team.","type":"text"}]},{"type":"paragraph","content":[{"text":"This workshop uses mouse data, so we will be using the ","type":"text"},{"text":"org.Mm.eg.db","type":"text","marks":[{"type":"strong"}]},{"text":" annotation package here. ","type":"text"},{"text":"Copy and paste this section of code into R.","type":"text","marks":[{"type":"backgroundColor","attrs":{"color":"#d3f1a7"}}]}]},{"type":"codeBlock","content":[{"text":"#### 6a. Annotating your DE genes ####\n# Annotation packages for human (org.Hs.eg.db), mouse (org.Mm.eg.db), rat (org.Rn.eg.db), E. coli strain K12 (org.EcK12.eg.db), E. coli strain Sakai (org.EcSakai.eg.db), zebrafish (org.Dr.eg.db) and Drosophila (org.Dm.eg.db) were installed with the required R packages. If your species is not in the above list, contact the eResearch team.\n## USER INPUT\n# Input your species genome below (select from the above list)\nmy_genome <- org.Mm.eg.db\n# Pull out just the Entrez/Ensembl gene IDs\ngene_ids <- row.names(res)\n# You can see the list of annotations you can apply to your data by:\nkeytypes(my_genome)\n# Annotate gene symbol and description to your DE gene IDs (you can add other keytypes from the above 'keytypes(my_genome)' list, if you choose)\ncols <- c(\"SYMBOL\", \"GENENAME\")\n\n## USER INPUT\n# Provide the gene ID type for your DE data\n# If you have Ensemble IDs, enter \"ENSEMBL\", if you have Entrez IDs, enter \"ENTREZID\", if you have gene symbols, enter \"SYMBOL\"\nidtype <- \"ENSEMBL\"\n# Map the Entrez/Ensembl gene IDs to gene symbol and description\nmap <- AnnotationDbi::select(my_genome, keys=gene_ids, columns=cols, keytype=idtype)\n# Combine the annotation data with the DE data\n# Since we're matching Ensembl -> Entrez there aren't a 1:1 mapping, so need to merge rather than cbind\nannot <- merge(x = res, y = map, by.x = 0, by.y = idtype, all = F)\n# There isn't always a 1:1 mapping between gene identifiers, so we also need to remove duplicates\nannot_nodups <- annot[!duplicated(annot$Row.names), ]\n# Reorder by adjusted p\nannot_nodups <- annot_nodups[order(annot_nodups$padj), ]\n\n# Add normalised counts to the output table. This is so you can later plot expression trends for individual genes in R, Excel, etc.\n# Need to normalise the counts first, using the size factors calculated by DESeq2 (in the 'deseq' object)\nexpdata_norm <- as.matrix(expdata) %*% diag(deseq$sizeFactor)\ncolnames(expdata_norm) <- colnames(expdata)\nannot_counts <- merge(x = annot_nodups, y = expdata_norm, by.x = \"Row.names\", by.y = 0, all = TRUE)\n# Export the full annotated dataset as a csv (for journal submission)\n# Creates a 'Tables' subdirectory where all tables will be output\ndir.create(\"Tables\", showWarnings = FALSE)\nwrite.csv(annot_counts, file=paste0(\"./Tables/all_genes_\", paste(degroups, collapse = \"_Vs_\"), \".csv\"), row.names = FALSE)\n# Pull out just significant genes (change from 0.05 to 0.01 if needed)\nDE_genes <- subset(annot_counts, padj < 0.05, select=colnames(annot_counts))\n# NOTE: add lfc cutoffs if needed. E.g., log2FoldChange > 1 and < log2FoldChange -1 cutoff\n# DE_genes <- subset(DE_genes, log2FoldChange > 1 | log2FoldChange < -1, select=colnames(DE_genes))\n# Export the sig DE genes as a table (to be used in downstream analysis)\nwrite.csv(DE_genes, file=paste0(\"./Tables/DE_genes_\", paste(degroups, collapse = \"_Vs_\"), \".csv\"), row.names = FALSE)","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"4b. Volcano plot","type":"text"}]},{"type":"paragraph","content":[{"text":"Now that we have our list of DE genes, we can visualise them with a volcano plot and heat map.","type":"text"}]},{"type":"paragraph","content":[{"text":"First, we’ll generate a volcano plot by copying and pasting this section of code into R.","type":"text","marks":[{"type":"backgroundColor","attrs":{"color":"#d3f1a7"}}]}]},{"type":"paragraph","content":[{"text":"We’re using the ","type":"text"},{"text":"EnhancedVolcano package ‘Publication-ready volcano plots with enhanced colouring and labeling’","type":"text","marks":[{"type":"link","attrs":{"href":"https://bioconductor.org/packages/release/bioc/html/EnhancedVolcano.html"}}]}]},{"type":"paragraph","content":[{"text":"The below plot only labels the top 20 DE genes, using mostly plot defaults. You can modify this code to label whatever genes you like, change plot colours, change axis labels, change log fold change cutoffs, etc, by following the EnhancedVolcano guide here:","type":"text"},{"type":"inlineCard","attrs":{"url":"https://bioconductor.org/packages/release/bioc/html/EnhancedVolcano.html"}}]},{"type":"codeBlock","content":[{"text":"#### 6b. Volcano plot ####\np <- EnhancedVolcano(annot_nodups, lab = annot_nodups$SYMBOL, selectLab = annot_nodups$SYMBOL[1:20], drawConnectors = TRUE, title = NULL, subtitle = NULL, x = 'log2FoldChange', y = 'pvalue')\np\n# NOTE: the above plot shows labels for the top significantly DE (i.e. by lowest adjusted p value) genes.\n# Output as publication quality (300dpi) tiff and pdf.\n# Create a (300dpi) tiff\nggsave(file = paste0(\"./figures/volcano_\", paste(degroups, collapse = \"_Vs_\"), \".tiff\"), dpi = 300, compression = \"lzw\", device = \"tiff\", width = 10, height = 8, plot = p)\n# Create a pdf\nggsave(file = paste0(\"./figures/volcano_\", paste(degroups, collapse = \"_Vs_\"), \".pdf\"), device = \"pdf\", width = 10, height = 8, plot = p)\n# Create a png\nggsave(file = paste0(\"./figures/volcano_\", paste(degroups, collapse = \"_Vs_\"), \".png\"), device = \"png\", width = 10, height = 8, plot = p)","type":"text"}]},{"type":"mediaSingle","attrs":{"layout":"center","width":720,"widthType":"pixel"},"content":[{"type":"media","attrs":{"width":720,"alt":"volcano_Basal_cells_Vs_Differentiated_cells.png","id":"a9974acd-1e1d-4f71-bf8d-52c893b39d8a","collection":"contentId-2380300338","type":"file","height":576}},{"type":"caption","content":[{"text":"Figure 3: Volcano plot showing how differentially expressed genes between treatment groups.","type":"text"}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"4c. DE genes heatmap","type":"text"}]},{"type":"paragraph","content":[{"text":"This section plots a heatmap and dendrogram of DE gene expression per sample. Expression counts are scaled and centered so that groupwise relationships can be examined. ","type":"text"},{"text":"Copy and paste this section of code into R.","type":"text","marks":[{"type":"backgroundColor","attrs":{"color":"#d3f1a7"}}]}]},{"type":"codeBlock","content":[{"text":"#### 6c. DE genes heatmaps and dendrograms ####\n# Make the row names gene symbols.\nDE_genes <- na.omit(DE_genes)\nrow.names(DE_genes) <- make.unique(DE_genes$SYMBOL)\n# sort by p-value\nDE_genes <- DE_genes[order(DE_genes$padj), ]\n# Pull out normalised counts only\nsiggc <- DE_genes[colnames(DE_genes) %in% colnames(expdata)]\n# Scale and center each row. This is important to visualise relative differences between groups and not have row-wise colouration dominated by high or low gene expression.\nxts <- scale(t(siggc))\nxtst <- t(xts)\n# Define annotation column\nannot_columns <- data.frame(meta$group[meta$group %in% degroups])\n# Make the row names the sample IDs\nrow.names(annot_columns) <- meta$sample_ID[meta$group %in% degroups]\ncolnames(annot_columns) <- \"Treatment groups\"\n# Need to factorise it\nannot_columns[[1]] <- factor(annot_columns[[1]])\n# Generate dendrogram and heatmap for ALL DE genes\npheatmap(xtst, color=colorRampPalette(c(\"#D55E00\", \"white\", \"#0072B2\"))(100), annotation_col=annot_columns, annotation_names_col = F, fontsize_col = 12, fontsize_row = 7, labels_row = row.names(siggc), show_rownames = F, filename = paste0(\"./figures/All_DEG_Heatmap_\", paste(plotgroups, collapse = \"_Vs_\"), \".tiff\"))\npheatmap(xtst, color=colorRampPalette(c(\"#D55E00\", \"white\", \"#0072B2\"))(100), annotation_col=annot_columns, annotation_names_col = F, fontsize_col = 12, fontsize_row = 7, labels_row = row.names(siggc), show_rownames = F, filename = paste0(\"./figures/All_DEG_Heatmap_\", paste(plotgroups, collapse = \"_Vs_\"), \".pdf\"))\npheatmap(xtst, color=colorRampPalette(c(\"#D55E00\", \"white\", \"#0072B2\"))(100), annotation_col=annot_columns, annotation_names_col = F, fontsize_col = 12, fontsize_row = 7, labels_row = row.names(siggc), show_rownames = F, filename = paste0(\"./figures/All_DEG_Heatmap_\", paste(plotgroups, collapse = \"_Vs_\"), \".png\"))\n# Generate dendrogram and heatmap for top 20 DE genes\npheatmap(xtst[1:20,], color=colorRampPalette(c(\"#D55E00\", \"white\", \"#0072B2\"))(100), annotation_col=annot_columns, annotation_names_col = F, fontsize_col = 14, fontsize_row = 12, labels_row = row.names(siggc), show_rownames = T, filename = paste0(\"./figures/Top_DEG_Heatmap_\", paste(plotgroups, collapse = \"_Vs_\"), \".tiff\"))\npheatmap(xtst[1:20,], color=colorRampPalette(c(\"#D55E00\", \"white\", \"#0072B2\"))(100), annotation_col=annot_columns, annotation_names_col = F, fontsize_col = 14, fontsize_row = 12, labels_row = row.names(siggc), show_rownames = T, filename = paste0(\"./figures/Top_DEG_Heatmap_\", paste(plotgroups, collapse = \"_Vs_\"), \".pdf\"))\npheatmap(xtst[1:20,], color=colorRampPalette(c(\"#D55E00\", \"white\", \"#0072B2\"))(100), annotation_col=annot_columns, annotation_names_col = F, fontsize_col = 14, fontsize_row = 12, labels_row = row.names(siggc), show_rownames = T, filename = paste0(\"./figures/Top_DEG_Heatmap_\", paste(plotgroups, collapse = \"_Vs_\"), \".png\"))\n# NOTE: you can plot more than 20 top genes by adjusting 'xtst[1:20,]'. If you wanted to plot the top 50 genes you'd change this to 'xtst[1:50,]'\n","type":"text"}]},{"type":"paragraph","content":[{"text":" ","type":"text"}]},{"type":"mediaSingle","attrs":{"layout":"center","width":504,"widthType":"pixel"},"content":[{"type":"media","attrs":{"width":504,"alt":"All_DEG_Heatmap_Differentiated_cells_Vs_Basal_cells.png","id":"b1071678-5206-463c-a519-e0c75b1dea9d","collection":"contentId-2380300338","type":"file","height":504}},{"type":"caption","content":[{"text":"Figure 4. Heat map showing Basal cells vs Differentiated cells for all genes","type":"text"}]}]},{"type":"paragraph","content":[{"text":" ","type":"text"}]},{"type":"mediaSingle","attrs":{"layout":"center","width":504,"widthType":"pixel"},"content":[{"type":"media","attrs":{"width":504,"alt":"Top_DEG_Heatmap_Differentiated_cells_Vs_Basal_cells.png","id":"9543da94-7419-4437-b2d5-ec85d911a638","collection":"contentId-2380300338","type":"file","height":504}},{"type":"caption","content":[{"text":"Figure 5. Heat map showing Basal cells vs Differentiated cells for the top 20 genes","type":"text"}]}]},{"type":"rule"},{"type":"layoutSection","content":[{"type":"layoutColumn","attrs":{"width":50.0},"content":[{"type":"orderedList","attrs":{"order":3},"content":[{"type":"listItem","content":[{"type":"paragraph","content":[{"text":"Checking for outliers and batch effects - previous","type":"text","marks":[{"type":"link","attrs":{"href":"https://eresearchqut.atlassian.net/wiki/spaces/EG/pages/2380300316"}}]}]}]}]}]},{"type":"layoutColumn","attrs":{"width":50.0},"content":[{"type":"paragraph","content":[{"text":"Homework (DE) - next","type":"text","marks":[{"type":"link","attrs":{"href":"https://eresearchqut.atlassian.net/wiki/spaces/EG/pages/2380201999"}}]}]}]}]}],"version":1}