Overview of today’s session:
During Session 2 a basic generic RNA-seq pipeline has been run, without specifying additional parameters that can ensure the removal of sequence biases to have a more precise estimation of gene expression (feature counts). In this session, we will do the following tasks:
inspect the results from Session 2
run an advanced RNA-seq pipeline to measure the expression of genes
(optional) run statistical analysis to identify differentially expressed genes
Task 1: Evaluation of RNA-seq results using a basic (generic) nextflow pipeline
The nextflow/RNA-seq pipeline automatically generates two output folders:
results - contains the main outputs generated by each of the pipeline steps. Most users only need to look at the files contained in this folder.
work - this folder contains all the files generated during the running of the pipeline including intermediate files. Most users do not need to look at the content in this folder, unless the pipeline did not run properly.
To view the results from the completed pipeline, enter the run folder (i.e., run1_star_salmon)
Code Block |
---|
#access the run folder for your samples. For example:
cd run1_star_salmon
#then access the results folder
cd results |
In the results folder you will find the following sub-folders:
Code Block |
---|
fastqc/
trimgalore/
multiqc/
star_salmon/
pipeline_info/ |
FASTQC Report - assessing the quality of input reads
Connect to the work folder via HPC-FS (See session 2). Browse to the fastqc output folder: run1_star_salmon → results → fastqc. Then click on the HTML reports for each file to assess the quality of raw data. You may also copy the files to your laptop by simply drag-and-drop to a relevant folder.
The main items to verify are denoted below.
Per base sequence quality:
Inspect the overall quality of the generated data per nucleotide position.
Reads with a quality score above 20 (Q20) are 90.0% accurate, and those with >= Q30 are 99.9% accurate.
For most applications, it is recommended to set a quality trimming score of 30. Note, by default the pipeline will remove poor quality reads and bases below Q20.
Per base sequence content:
Determine if biases in the distribution of A, T, C, and G nucleotides are present on either the 5'-end and 3'-end of the reads
Recommendation: remove the first 10 nucleotides from the 5'-end (hexamer primer bias during PCR amplification) and 2 nucleotides from the 3'-end of reads (these bases can interfere with the proper mapping of reads onto reference genomes/transcriptomes).
Check other items reported in the FASTQC report such as level of duplication, highly abundant sequences, and presence of adapter sequences.
MultiQC Report - provides an overview of the quality, trimming, mapping, PCA, and many informative statistics of all files in the experiment in a single report.
Connect to the work folder via HPC-FS (See session 2). Browse to the fastqc output folder: run1_star_salmon → results → multiqc.
Task 2: Run the nextflow nf-core/rnaseq pipeline by including advanced filtering parameters
Requirements:
index.csv → a file that provides a list of sample IDs and their associated FASTQ files (read 1 and read 2)
launch.pbs → a script to submit the job to the HPC cluster
Example index.csv file for nf-core/rnaseq version 3.3:
Code Block |
---|
group,fastq_1,fastq_2,strandedness
control_r1,/work/kenna_team/data/raw_data/SRR1039508_1.fastq.gz,/work/kenna_team/data/raw_data/SRR1039508_2.fastq.gz,unstranded
dex_r1,/work/kenna_team/data/raw_data/SRR1039509_1.fastq.gz,/work/kenna_team/data/raw_data/SRR1039509_2.fastq.gz,unstranded
control_r2,/work/kenna_team/data/raw_data/SRR1039512_1.fastq.gz,/work/kenna_team/data/raw_data/SRR1039512_2.fastq.gz,unstranded
dex_r2,/work/kenna_team/data/raw_data/SRR1039513_1.fastq.gz,/work/kenna_team/data/raw_data/SRR1039513_2.fastq.gz,unstranded
control_r3,/work/kenna_team/data/raw_data/SRR1039516_1.fastq.gz,/work/kenna_team/data/raw_data/SRR1039516_2.fastq.gz,unstranded
dex_r3,/work/kenna_team/data/raw_data/SRR1039517_1.fastq.gz,/work/kenna_team/data/raw_data/SRR1039517_2.fastq.gz,unstranded
control_r4,/work/kenna_team/data/raw_data/SRR1039520_1.fastq.gz,/work/kenna_team/data/raw_data/SRR1039520_2.fastq.gz,unstranded
dex_r4,/work/kenna_team/data/raw_data/SRR1039521_1.fastq.gz,/work/kenna_team/data/raw_data/SRR1039521_2.fastq.gz,unstranded
|
Example of a launch.pbs script with advanced parameter options:
Code Block |
---|
#!/bin/bash -l
#PBS -N nfrnaseq
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00
#Use the current directory to run the workflow
cd $PBS_O_WORKDIR
module load java
NXF_OPTS='-Xms1g -Xmx4g'
#run the nextflow RNA-seq pipeline:
nextflow run nf-core/rnaseq -profile singularity -r 3.3 --aligner star_salmon --input index.csv --genome GRCh38 --clip_r1 10 --clip_r2 10 --three_prime_clip_r1 2 --three_prime_clip_r2 2 --save_trimmed -resume
#allow access to others in the group
chmod -R g+rwX results
chmod -R g+rwX work |
Session 3 exercises:
Run the nf-core/rnaseq pipeline using the Airway smooth muscle public data (PMID: 24926665. GEO: GSE52778) - aligner option set to ‘star_salmon’
Same as above but aligner option set to ‘star_rsem’
Create a new working folder:
Code Block |
---|
mkdir session3
cd session3
mkdir run1_star_salmon
cd run1_star_salmon
|
Copy index.csv and launch.pbs files to the newly created folder
Code Block |
---|
cp /work/kenna_team/scripts/star_salmon/session3/* . |
Check that files were copied into the new working folder
Code Block |
---|
ls -a
./ ../ index.csv launch.pbs
#verify the content of index.csv
cat index.csv
#also check the PBS Pro submission script
cat launch.pbs
|
Run the workflow:
Code Block |
---|
qsub launch.pbs |
Monitor the progress of the workflow:
Code Block |
---|
qjobs |
or
Code Block |
---|
qstats -u userID |
Repeat the above process for ‘star_rsem’
The only variation is copying the index.csv and launch.pbs script. As follows:
Code Block |
---|
cp /work/kenna_team/scripts/star_rsem/session3/* . |
Task 3 - Differential gene expression analysis
Differential expression analysis using https://maayanlab.cloud/biojupies/analyze
From session 2, we will use the feature counts generated using the ‘star_salmon’ parameter setting. Either copy the relevant files (see below) to your laptop from the HPC or download the files below.
File | File name | File | ||||
---|---|---|---|---|---|---|
Gene feature counts | salmon_merged.gene_counts.tsv |
| ||||
Transcript feature counts | salmon_merged.transcript_counts.tsv |
|
Reading:
https://www.nature.com/articles/srep25533
Create index file using Unix command line:
Code Block |
---|
#Raw data folder
cd /work/kenna_team/data/raw_data
#STEP1: get list of read 1 files and save it into a file1.txt file
ls *_1* > file1.txt
#STEP2: get list of read 2 files and save it into a file2.txt file
ls *_2* > file2.txt
#(Optional) Alternative to the above steps, when color in terminal adds non ASCII character to file names
ls --color=never *_1* > file1.txt
ls --color=never *_2* > file2.txt
#STEP3: add path to FASTQ files into the file1.txt and file2.txt files
cat file1.txt | awk '{print "/work/kenna_team/data/raw_data/" $1}' > file1.mod
cat file2.txt | awk '{print "/work/kenna_team/data/raw_data/" $1}' > file2.mod
#STEP4: merge modified file1.mod and file2.mod into the same file
paste file1.mod file2.mod > combine_file1_2.mod
#STEP5: create the names of samples manually using vi or nano. Alternatively, can do this using excel.
vi sample_names.txt
#STEP6: combine sample_names.txt and combine_file1_2.mod files
paste sample_names.txt combine_file1_2.mod > combine_sampleName_file1_2.mod
#STEP7: finalise the index.csv file
cat combine_sampleName_file1_2.mod | awk '{print $1 "," $2 "," $3 ",unstranded" }' > index.csv |
Follow up task:
Once the session 3 run_start_salmon has completed, run task 3 above with the newly generated feature counts. Do you observe the same findings?