2024-2: 7b-Exercises - MirGeneDB
Save a copy of the DESeq2.R script into the run3_MirGeneDB folder and edit it as below…
Exercises for you to try:
There is a different database for microRNA that we’ve analysed this dataset against, called MirGeneDB. MirGeneDB is a database of manually curated microRNA genes that have been validated and annotated as initially described in Fromm et al. 2015 and Fromm et al. 2020. MirGeneDB 2.1 includes more than 16,000 microRNA gene entries representing more than 1,500 miRNA families from 75 metazoan species and published in the 2022 NAR database issue.
The output of the MirGeneDB analysis can be found at /work/training/2024/smallRNAseq/runs/run3_MirGeneDB, if you want to practice editing the R scripts we’ve given you to get the same plots as above for this analysis (in preparation for you doing it for your own data).
Precomputed results from session 6:
We ran the small RNA seq samples against the MirGeneDB database and the results can be found at:
/work/training/2024/smallRNAseq/runs/run3_MirGeneDB/results/mirna_quant/edger_qc/mature_counts.csv
/work/training/2024/smallRNAseq/data/human_disease/metadata_microRNA.txt
Let’s create a “DESeq2” folder and copy the files needed for the statistical analysis:
cp $HOME/workshop/2024-2/session6_smallRNAseq/scripts/transpose_csv.py $HOME/workshop/2024-2/session6_smallRNAseq/runs/run2_human_MirGeneDB/DESeq2
cp $HOME/workshop/2024-2/session6_smallRNAseq/data/metadata_microRNA.txt $HOME/workshop/2024-2/session6_smallRNAseq/runs/run2_human_MirGeneDB/DESeq2
cp /work/training/2024/smallRNAseq/runs/run3_MirGeneDB/results/mirna_quant/edger_qc/mature_counts.csv $HOME/workshop/2024-2/session6_smallRNAseq/runs/run2_human_MirGeneDB/DESeq2
cd $HOME/workshop/2024-2/session6_smallRNAseq/runs/run2_human_MirGeneDB/DESeq2
To transpose the initial “mature_counts.csv” file do the following:
python transpose_csv.py --input mature_counts.csv --out mature_counts.txt
Differential expression analysis using RStudio
Run analysis script in RStudio
Pre-steps: Open RStudio, Create a new R script ('File'->'New File'-> ‘R script’), Hit the save button and save this file in the working directory you created above (H:\workshop\2024-2\session6_smallRNAseq\runs\run2_human_MirGeneDB\DESeq2
). Name the R script ‘DESeq2.R’.
Step 1: LOAD PACKAGES
Step 2: IMPORT DATA: change setwd() line, read.csv line and read.table line
Step 3: LOOKING FOR OUTLIERS AND BATCH EFFECTS - TRANSFORM DATA: remove low-coverage transcripts below 20
Step 4: LOOKING FOR OUTLIERS AND BATCH EFFECTS - VISUALISE DATA (PCA): change the confidence interval ellipse on the PCA to 99%
Step 5: LOOKING FOR OUTLIERS AND BATCH EFFECTS - VISUALISE DATA (HEATMAP): change the colours in the heatmap to something you like better - https://www.colorhexa.com/11dd66
Step 6: LOOKING FOR DIFFERENTIALLY EXPRESSED GENES: change the p-value of the significantly differentially expressed genes to 0.01
Step 7: LOOKING FOR DIFFERENTIALLY EXPRESSED GENES - VISUALISATION (VOLCANO PLOT): change the dot size on the volcano plot
Step 8: REMOVE OUTLIERS and LOOKING FOR DIFFERENTIALLY EXPRESSED GENES - VISUALISATION (PCA, VOLCANO PLOT): remember to change the low-coverage transcripts level to 20, change the confidence interval ellipses on the PCA to 99%, and change the p-value to 0.01
Step 9: DIFFERENTIAL EXPRESSION ANALYSIS - HEATMAP and DENDROGRAM How many outliers are removed? How many differentially expressed genes were identified using the mirGeneDB and the p-value of 0.01 (after outliers were removed) compared to the mirBase database and the p-value of 0.05 (after outliers were removed)?