- 1.1 1a. Count Table
2 H:\workshop\2024\rnaseq\DE_analysis_workshop\data
- - 2.1.1 Examine your count table
- 2.2 1d. Sample Table - metadata

1a. Count Table

a. First create a new folder in H:\workshop\2024\rnaseq\DE_analysis_workshop\data . Call it something suitable, such as ‘DE_analysis_workshop’

b. Create a subfolder here called ‘data’. This is where your two data files will be stored

H:\workshop\2024\rnaseq\DE_analysis_workshop\data

Examine your count table

The basis for the differential expression analysis is a count table of sequence reads mapped to defined gene regions per sample. There are a variety of methods to generate this count table, but for this exercise, we will be using the output from the Nextflow nfcore/rnaseq analysis you completed in the previous workshop sessions.

To access this count table:

Go to the W:\training\2024\rnaseq\runs\run3_RNAseq\results folder that contains the results from running the nfcore/rnaseq pipeline. The output folders from task 3 look like this:

├── results
│   ├── fastqc
│   ├── multiqc
│   ├── pipeline_info
│   ├── star_salmon
│   └── trimgalore

The count table can be found in the star_salmon folder. A list of files and folders in the star_salmon folder will look like this:

.
├── bigwig
├── deseq2_qc
├── dupradar
├── featurecounts
├── log
├── metadata.xlsx
├── picard_metrics
├── qualimap
├── rseqc
├── salmon.merged.gene_counts_length_scaled.rds
├── salmon.merged.gene_counts_length_scaled.tsv
├── salmon.merged.gene_counts.rds
├── salmon.merged.gene_counts_scaled.rds
├── salmon.merged.gene_counts_scaled.tsv
├── salmon.merged.gene_counts.tsv  <---- We will use this feature counts file for DESeq2 expression analysis. 
├── salmon.merged.gene_tpm.tsv
├── salmon.merged.transcript_counts.rds
├── salmon.merged.transcript_counts.tsv
├── salmon.merged.transcript_tpm.tsv
├── samtools_stats
├── SRR20622172
├── SRR20622172.markdup.sorted.bam
├── SRR20622172.markdup.sorted.bam.bai
├── SRR20622173
├── SRR20622173.markdup.sorted.bam
├── SRR20622173.markdup.sorted.bam.bai
├── SRR20622174
├── SRR20622174.markdup.sorted.bam
├── SRR20622174.markdup.sorted.bam.bai
├── SRR20622175
├── SRR20622175.markdup.sorted.bam
├── SRR20622175.markdup.sorted.bam.bai
├── SRR20622176
├── SRR20622176.markdup.sorted.bam
├── SRR20622176.markdup.sorted.bam.bai
├── SRR20622177
├── SRR20622177.markdup.sorted.bam
├── SRR20622177.markdup.sorted.bam.bai
├── SRR20622178
├── SRR20622178.markdup.sorted.bam
├── SRR20622178.markdup.sorted.bam.bai
├── SRR20622179
├── SRR20622179.markdup.sorted.bam
├── SRR20622179.markdup.sorted.bam.bai
├── SRR20622180
├── SRR20622180.markdup.sorted.bam
├── SRR20622180.markdup.sorted.bam.bai
├── stringtie
└── tx2gene.tsv

The expression count file that we are interested in is salmon.merged.gene_counts.tsv

head salmon.merged.gene_counts.tsv

The count table looks like this:

gene_id	gene_name	SRR20622172	SRR20622173	SRR20622174	SRR20622175	SRR20622176	SRR20622177	SRR20622178	SRR20622179	SRR20622180
ENSMUSG00000000001	Gnai3	7086	4470	2457.002	2389	6398	2744	2681	3961	4399
ENSMUSG00000000003	Pbsn	0	0	0	0	0	0	0	0	0
ENSMUSG00000000028	Cdc45	1232.999	827	42	57	1036	55	78	88	89
ENSMUSG00000000031	H19	200	139	2	0	143.622	1	17.082	24	16.077
ENSMUSG00000000037	Scml2	70	57.001	8	8	66.999	16	23	27.999	29
ENSMUSG00000000049	Apoh	0	0	1	0	2	2	1	3	0
ENSMUSG00000000056	Narf	1933	1480	519	497	1730	539	365	458	536
ENSMUSG00000000058	Cav2	6008	3417	1347.001	1344	5482	1367	2669.001	4358	4365.832
ENSMUSG00000000078	Klf6	3809	2732	4413.001	3483.978	3559	4491	3209	3980	4626

c. Copy the count table (the ‘salmon.merged.gene_counts.tsv' file) to the 'data’ folder you just created.

cp /work/training/2024/rnaseq/runs/run3_RNAseq/results/star_salmon/salmon.merged.gene_counts.tsv $HOME/workshop/2024/rnaseq/DE_analysis_workshop/data/

1d. Sample Table - metadata

In the same W:\training\2024\rnaseq\runs\run3_RNAseq\results\star_salmon directory there will be a file called metadata.xlsx . This file will normally need to be manually created by you to match your sample IDs and treatment groups, but we created this file already for you to use. This samples table needs 3 columns called ‘sample_name’, containing the sample names seen in the count table (column names), ‘sample_ID’, which is the (less messy) names you want to call the samples in this analysis workflow, and ‘group’, which contains the treatment groups each sample belongs to. The contents of this file look like this:

sample_name	sample_ID	group
SRR20622174	DC1	Differentiated_cells
SRR20622175	DC2	Differentiated_cells
SRR20622177	DC3	Differentiated_cells
SRR20622178	BC1	Basal_cells
SRR20622179	BC2	Basal_cells
SRR20622180	BC3	Basal_cells
SRR20622172	mTEC1	Murine_tracheal_epithelial_cell
SRR20622173	mTEC2	Murine_tracheal_epithelial_cell
SRR20622176	mTEC3	Murine_tracheal_epithelial_cell

Copy this file to your ‘data’ folder as well.

 cp /work/training/2024/rnaseq/runs/run3_RNAseq/results/star_salmon/metadata.xlsx $HOME/workshop/2024/rnaseq/DE_analysis_workshop/data/

e. Open RStudio and create a new R script (‘File’ → “New File” → “R script”). Now hit ‘File’ → ‘Save’ and save the script in the analysis workshop folder you created in step a. (NOT IN THE ‘data’ FOLDER). Give the script file a name (e.g. DESEq2.R).

Introduction to DE - previous

R packages - installing, loading and data importation - next

ER-User Guides

2024-2: 5a.1 Preparing your data for DE

1a. Count Table

H:\workshop\2024\rnaseq\DE_analysis_workshop\data

Examine your count table

1d. Sample Table - metadata

Related content