Session 3: Advanced RNA-seq pipeline

Overview of today’s session:

During Session 2 a basic generic RNA-seq pipeline has been run, without specifying additional parameters that can ensure the removal of sequence biases to have a more precise estimation of gene expression (feature counts). In this session, we will do the following tasks:

inspect the results from Session 2
run an advanced RNA-seq pipeline to measure the expression of genes
(optional) run statistical analysis to identify differentially expressed genes

Task 1: Evaluation of RNA-seq results using a basic (generic) nextflow pipeline

The nextflow/RNA-seq pipeline automatically generates two output folders:

results - contains the main outputs generated by each of the pipeline steps. Most users only need to look at the files contained in this folder.
work - this folder contains all the files generated during the running of the pipeline including intermediate files. Most users do not need to look at the content in this folder, unless the pipeline did not run properly.

To view the results from the completed pipeline, enter the run folder (i.e., run1_star_salmon)

#access the run folder for your samples. For example:
cd run1_star_salmon

#then access the results folder
cd results

In the results folder you will find the following sub-folders:

fastqc/
trimgalore/
multiqc/
star_salmon/
pipeline_info/

FASTQC Report - assessing the quality of input reads

Connect to the work folder via HPC-FS (See session 2). Browse to the fastqc output folder: run1_star_salmon → results → fastqc. Then click on the HTML reports for each file to assess the quality of raw data. You may also copy the files to your laptop by simply drag-and-drop to a relevant folder.

The main items to verify are denoted below.

Per base sequence quality:
- Inspect the overall quality of the generated data per nucleotide position.
- Reads with a quality score above 20 (Q20) are 90.0% accurate, and those with >= Q30 are 99.9% accurate.
- For most applications, it is recommended to set a quality trimming score of 30. Note, by default the pipeline will remove poor quality reads and bases below Q20.
Per base sequence content:
- Determine if biases in the distribution of A, T, C, and G nucleotides are present on either the 5'-end and 3'-end of the reads
- Recommendation: remove the first 10 nucleotides from the 5'-end (hexamer primer bias during PCR amplification) and 2 nucleotides from the 3'-end of reads (these bases can interfere with the proper mapping of reads onto reference genomes/transcriptomes).
Check other items reported in the FASTQC report such as level of duplication, highly abundant sequences, and presence of adapter sequences.

MultiQC Report - provides an overview of the quality, trimming, mapping, PCA, and many informative statistics of all files in the experiment in a single report.

Connect to the work folder via HPC-FS (See session 2). Browse to the fastqc output folder: run1_star_salmon → results → multiqc.