/
2024-2: 5a.1 Preparing your data for DE

2024-2: 5a.1 Preparing your data for DE

1a. Count Table

a. First create a new folder in H:\workshop\2024\rnaseq\DE_analysis_workshop\data . Call it something suitable, such as ‘DE_analysis_workshop’

 

b. Create a subfolder here called ‘data’. This is where your two data files will be stored

H:\workshop\2024\rnaseq\DE_analysis_workshop\data

Examine your count table

The basis for the differential expression analysis is a count table of sequence reads mapped to defined gene regions per sample. There are a variety of methods to generate this count table, but for this exercise, we will be using the output from the Nextflow nfcore/rnaseq analysis you completed in the previous workshop sessions.

To access this count table:

Go to the W:\training\2024\rnaseq\runs\run3_RNAseq\results folder that contains the results from running the nfcore/rnaseq pipeline. The output folders from task 3 look like this:

├── results │   ├── fastqc │   ├── multiqc │   ├── pipeline_info │   ├── star_salmon │   └── trimgalore

The count table can be found in the star_salmon folder. A list of files and folders in the star_salmon folder will look like this:

. ├── bigwig ├── deseq2_qc ├── dupradar ├── featurecounts ├── log ├── metadata.xlsx ├── picard_metrics ├── qualimap ├── rseqc ├── salmon.merged.gene_counts_length_scaled.rds ├── salmon.merged.gene_counts_length_scaled.tsv ├── salmon.merged.gene_counts.rds ├── salmon.merged.gene_counts_scaled.rds ├── salmon.merged.gene_counts_scaled.tsv ├── salmon.merged.gene_counts.tsv <---- We will use this feature counts file for DESeq2 expression analysis. ├── salmon.merged.gene_tpm.tsv ├── salmon.merged.transcript_counts.rds ├── salmon.merged.transcript_counts.tsv ├── salmon.merged.transcript_tpm.tsv ├── samtools_stats ├── SRR20622172 ├── SRR20622172.markdup.sorted.bam ├── SRR20622172.markdup.sorted.bam.bai ├── SRR20622173 ├── SRR20622173.markdup.sorted.bam ├── SRR20622173.markdup.sorted.bam.bai ├── SRR20622174 ├── SRR20622174.markdup.sorted.bam ├── SRR20622174.markdup.sorted.bam.bai ├── SRR20622175 ├── SRR20622175.markdup.sorted.bam ├── SRR20622175.markdup.sorted.bam.bai ├── SRR20622176 ├── SRR20622176.markdup.sorted.bam ├── SRR20622176.markdup.sorted.bam.bai ├── SRR20622177 ├── SRR20622177.markdup.sorted.bam ├── SRR20622177.markdup.sorted.bam.bai ├── SRR20622178 ├── SRR20622178.markdup.sorted.bam ├── SRR20622178.markdup.sorted.bam.bai ├── SRR20622179 ├── SRR20622179.markdup.sorted.bam ├── SRR20622179.markdup.sorted.bam.bai ├── SRR20622180 ├── SRR20622180.markdup.sorted.bam ├── SRR20622180.markdup.sorted.bam.bai ├── stringtie └── tx2gene.tsv

The expression count file that we are interested in is salmon.merged.gene_counts.tsv

head salmon.merged.gene_counts.tsv

The count table looks like this:

 

c. Copy the count table (the ‘salmon.merged.gene_counts.tsv' file) to the 'data’ folder you just created.

1d. Sample Table - metadata

In the same W:\training\2024\rnaseq\runs\run3_RNAseq\results\star_salmon directory there will be a file called metadata.xlsx . This file will normally need to be manually created by you to match your sample IDs and treatment groups, but we created this file already for you to use. This samples table needs 3 columns called ‘sample_name’, containing the sample names seen in the count table (column names), ‘sample_ID’, which is the (less messy) names you want to call the samples in this analysis workflow, and ‘group’, which contains the treatment groups each sample belongs to. The contents of this file look like this:

Copy this file to your ‘data’ folder as well.

 

e. Open RStudio and create a new R script (‘File’ → “New File” → “R script”). Now hit ‘File’ → ‘Save’ and save the script in the analysis workshop folder you created in step a. (NOT IN THE ‘data’ FOLDER). Give the script file a name (e.g. DESEq2.R).

 


Related content

2024-2 eResearch - Session 5: Differential expression analysis using R for RNAseq
2024-2 eResearch - Session 5: Differential expression analysis using R for RNAseq
More like this
2024-2: 5a.2 R packages (DE) - installing, loading and data importation
2024-2: 5a.2 R packages (DE) - installing, loading and data importation
Read with this
2024-2: 5a-Introduction - Differential Expression (DE) using DESeq2
2024-2: 5a-Introduction - Differential Expression (DE) using DESeq2
More like this
2024-2: 5a.4 Identifying differentially expressed (DE) genes
2024-2: 5a.4 Identifying differentially expressed (DE) genes
Read with this
Session 3: Advanced RNA-seq pipeline
Session 3: Advanced RNA-seq pipeline
More like this
2024-2: 5a.3 Checking for outliers and batch effects
2024-2: 5a.3 Checking for outliers and batch effects
Read with this