Table of Contents | ||
---|---|---|
|
1a. Count Table
a. First create a new folder in H:\workshop\2024\rnaseq . Call it something suitable, such as ‘DE_analysis_workshop’
b. Create a subfolder here called ‘data’. This is where your two data files will be stored
Examine your count table
The basis for the differential expression analysis is a count table of sequence reads mapped to defined gene regions per sample. There are a variety of methods to generate this count table, but for this exercise, we will be using the output from the Nextflow nfcore/rnaseq analysis you completed in the previous workshop sessions.
To access this count table:
Go to theW:\training\2024\rnaseq\runs\run3_RNAseq\results folder that contains the results from running the nfcore/rnaseq pipeline. The output folders from task 3 look like this:
Code Block |
---|
├── results │ ├── fastqc │ ├── multiqc │ ├── pipeline_info │ ├── star_salmon │ └── trimgalore |
The count table can be found in the star_salmonfolder. A list of files and folders in the star_salmon folder will look like this:
Code Block |
---|
. ├── bigwig ├── deseq2_qc ├── dupradar ├── featurecounts ├── log ├── metadata.xlsx ├── picard_metrics ├── qualimap ├── rseqc ├── salmon.merged.gene_counts_length_scaled.rds ├── salmon.merged.gene_counts_length_scaled.tsv ├── salmon.merged.gene_counts.rds ├── salmon.merged.gene_counts_scaled.rds ├── salmon.merged.gene_counts_scaled.tsv ├── salmon.merged.gene_counts.tsv <---- We will use this feature counts file for DESeq2 expression analysis. ├── salmon.merged.gene_tpm.tsv ├── salmon.merged.transcript_counts.rds ├── salmon.merged.transcript_counts.tsv ├── salmon.merged.transcript_tpm.tsv ├── samtools_stats ├── SRR20622172 ├── SRR20622172.markdup.sorted.bam ├── SRR20622172.markdup.sorted.bam.bai ├── SRR20622173 ├── SRR20622173.markdup.sorted.bam ├── SRR20622173.markdup.sorted.bam.bai ├── SRR20622174 ├── SRR20622174.markdup.sorted.bam ├── SRR20622174.markdup.sorted.bam.bai ├── SRR20622175 ├── SRR20622175.markdup.sorted.bam ├── SRR20622175.markdup.sorted.bam.bai ├── SRR20622176 ├── SRR20622176.markdup.sorted.bam ├── SRR20622176.markdup.sorted.bam.bai ├── SRR20622177 ├── SRR20622177.markdup.sorted.bam ├── SRR20622177.markdup.sorted.bam.bai ├── SRR20622178 ├── SRR20622178.markdup.sorted.bam ├── SRR20622178.markdup.sorted.bam.bai ├── SRR20622179 ├── SRR20622179.markdup.sorted.bam ├── SRR20622179.markdup.sorted.bam.bai ├── SRR20622180 ├── SRR20622180.markdup.sorted.bam ├── SRR20622180.markdup.sorted.bam.bai ├── stringtie └── tx2gene.tsv |
The expression count file that we are interested in is salmon.merged.gene_counts.tsv
Code Block |
---|
head salmon.merged.gene_counts.tsv |
The count table looks like this:
Code Block |
---|
gene_id gene_name SRR20622172 SRR20622173 SRR20622174 SRR20622175 SRR20622176 SRR20622177 SRR20622178 SRR20622179 SRR20622180 ENSMUSG00000000001 Gnai3 7086 4470 2457.002 2389 6398 2744 2681 3961 4399 ENSMUSG00000000003 Pbsn 0 0 0 0 0 0 0 0 0 ENSMUSG00000000028 Cdc45 1232.999 827 42 57 1036 55 78 88 89 ENSMUSG00000000031 H19 200 139 2 0 143.622 1 17.082 24 16.077 ENSMUSG00000000037 Scml2 70 57.001 8 8 66.999 16 23 27.999 29 ENSMUSG00000000049 Apoh 0 0 1 0 2 2 1 3 0 ENSMUSG00000000056 Narf 1933 1480 519 497 1730 539 365 458 536 ENSMUSG00000000058 Cav2 6008 3417 1347.001 1344 5482 1367 2669.001 4358 4365.832 ENSMUSG00000000078 Klf6 3809 2732 4413.001 3483.978 3559 4491 3209 3980 4626 |
c. Copy the count table (the ‘salmon.merged.gene_counts.tsv
' file) to the 'data’ folder you just created.
Code Block |
---|
cp /work/training/2024/rnaseq/runs/run3_RNAseq/results/star_salmon/salmon.merged.gene_counts.tsv $HOME/workshop/2024/rnaseq/DE_analysis_workshop/data/ |
1d. Sample Table - metadata
In the same W:\training\2024\rnaseq\runs\run3_RNAseq\results\star_salmon directory there will be a file called metadata.xlsx . This file will normally need to be manually created by you to match your sample IDs and treatment groups, but we created this file already for you to use. This samples table needs 3 columns called ‘sample_name’, containing the sample names seen in the count table (column names), ‘sample_ID’, which is the (less messy) names you want to call the samples in this analysis workflow, and ‘group’, which contains the treatment groups each sample belongs to. The contents of this file look like this:
Code Block |
---|
sample_name sample_ID group SRR20622174 DC1 Differentiated_cells SRR20622175 DC2 Differentiated_cells SRR20622177 DC3 Differentiated_cells SRR20622178 BC1 Basal_cells SRR20622179 BC2 Basal_cells SRR20622180 BC3 Basal_cells SRR20622172 mTEC1 Murine_tracheal_epithelial_cell SRR20622173 mTEC2 Murine_tracheal_epithelial_cell SRR20622176 mTEC3 Murine_tracheal_epithelial_cell |
Copy this file to your ‘data’ folder as well.
Code Block |
---|
cp /work/training/2024/rnaseq/runs/run3_RNAseq/results/star_salmon/metadata.xlsx $HOME/workshop/2024/rnaseq/DE_analysis_workshop/data/ |
e. Open RStudio and create a new R script (‘File’ → “New File” → “R script”). Now hit ‘File’ → ‘Save’ and save the script in the analysis workshop folder you created in step a. (NOT IN THE ‘data’ FOLDER). Give the script file a name (e.g. DESEq2.R).