2024-2: 5a.1 Preparing your data for DE
1a. Count Table
a. First create a new folder in H:\workshop\2024\rnaseq\DE_analysis_workshop\data . Call it something suitable, such as ‘DE_analysis_workshop’
b. Create a subfolder here called ‘data’. This is where your two data files will be stored
H:\workshop\2024\rnaseq\DE_analysis_workshop\data
Examine your count table
The basis for the differential expression analysis is a count table of sequence reads mapped to defined gene regions per sample. There are a variety of methods to generate this count table, but for this exercise, we will be using the output from the Nextflow nfcore/rnaseq analysis you completed in the previous workshop sessions.
To access this count table:
Go to the W:\training\2024\rnaseq\runs\run3_RNAseq\results folder that contains the results from running the nfcore/rnaseq pipeline. The output folders from task 3 look like this:
├── results
│ ├── fastqc
│ ├── multiqc
│ ├── pipeline_info
│ ├── star_salmon
│ └── trimgalore
The count table can be found in the star_salmon folder. A list of files and folders in the star_salmon folder will look like this:
.
├── bigwig
├── deseq2_qc
├── dupradar
├── featurecounts
├── log
├── metadata.xlsx
├── picard_metrics
├── qualimap
├── rseqc
├── salmon.merged.gene_counts_length_scaled.rds
├── salmon.merged.gene_counts_length_scaled.tsv
├── salmon.merged.gene_counts.rds
├── salmon.merged.gene_counts_scaled.rds
├── salmon.merged.gene_counts_scaled.tsv
├── salmon.merged.gene_counts.tsv <---- We will use this feature counts file for DESeq2 expression analysis.
├── salmon.merged.gene_tpm.tsv
├── salmon.merged.transcript_counts.rds
├── salmon.merged.transcript_counts.tsv
├── salmon.merged.transcript_tpm.tsv
├── samtools_stats
├── SRR20622172
├── SRR20622172.markdup.sorted.bam
├── SRR20622172.markdup.sorted.bam.bai
├── SRR20622173
├── SRR20622173.markdup.sorted.bam
├── SRR20622173.markdup.sorted.bam.bai
├── SRR20622174
├── SRR20622174.markdup.sorted.bam
├── SRR20622174.markdup.sorted.bam.bai
├── SRR20622175
├── SRR20622175.markdup.sorted.bam
├── SRR20622175.markdup.sorted.bam.bai
├── SRR20622176
├── SRR20622176.markdup.sorted.bam
├── SRR20622176.markdup.sorted.bam.bai
├── SRR20622177
├── SRR20622177.markdup.sorted.bam
├── SRR20622177.markdup.sorted.bam.bai
├── SRR20622178
├── SRR20622178.markdup.sorted.bam
├── SRR20622178.markdup.sorted.bam.bai
├── SRR20622179
├── SRR20622179.markdup.sorted.bam
├── SRR20622179.markdup.sorted.bam.bai
├── SRR20622180
├── SRR20622180.markdup.sorted.bam
├── SRR20622180.markdup.sorted.bam.bai
├── stringtie
└── tx2gene.tsv
The expression count file that we are interested in is salmon.merged.gene_counts.tsv
head salmon.merged.gene_counts.tsv
The count table looks like this:
c. Copy the count table (the ‘salmon.merged.gene_counts.tsv
' file) to the 'data’ folder you just created.
1d. Sample Table - metadata
In the same W:\training\2024\rnaseq\runs\run3_RNAseq\results\star_salmon directory there will be a file called metadata.xlsx . This file will normally need to be manually created by you to match your sample IDs and treatment groups, but we created this file already for you to use. This samples table needs 3 columns called ‘sample_name’, containing the sample names seen in the count table (column names), ‘sample_ID’, which is the (less messy) names you want to call the samples in this analysis workflow, and ‘group’, which contains the treatment groups each sample belongs to. The contents of this file look like this:
Copy this file to your ‘data’ folder as well.
e. Open RStudio and create a new R script (‘File’ → “New File” → “R script”). Now hit ‘File’ → ‘Save’ and save the script in the analysis workshop folder you created in step a. (NOT IN THE ‘data’ FOLDER). Give the script file a name (e.g. DESEq2.R).