Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview of today’s session:

...

  • inspect the results from Session 2

  • run an advanced RNA-seq pipeline to measure the expression of genes

  • (optional) run statistical analysis to identify differentially expressed genes

Task 1: Evaluation of RNA-seq results using a basic (generic) nextflow pipeline

The nextflow/RNA-seq pipeline automatically generates two output folders:

...

Code Block
fastqc/
trimgalore/
multiqc/
star_salmon/
pipeline_info/

Fastqc FASTQC Report - assessing the quality of input reads

For example; Read1:

View file
namecontrol_r1_1_fastqc.html

Read 2:

...

Connect to the work folder via HPC-FS (See session 2). Browse to the fastqc output folder: run1_star_salmonresultsfastqc. Then click on the HTML reports for each file to assess the quality of raw data. You may also copy the files to your laptop by simply drag-and-drop to a relevant folder.

The main items to verify are denoted below.

  • Per base sequence quality:

    • Inspect the overall quality of the generated data per nucleotide position.

    • Reads with a quality score above 20 (Q20) are 90.0% accurate, and those with >= Q30 are 99.9% accurate.

    • For most applications, it is recommended to set a quality trimming score of 30. Note, by default the pipeline will remove poor quality reads and bases below Q20.

  • Per base sequence content:

    • Determine if biases in the distribution of A, T, C, and G nucleotides are present on either the 5'-end and 3'-end of the reads

    • Recommendation: remove the first 10 nucleotides from the 5'-end (hexamer primer bias during PCR amplification) and 2 nucleotides from the 3'-end of reads (these bases can interfere with the proper mapping of reads onto reference genomes/transcriptomes).

  • Check other items reported in the FASTQC report such as level of duplication, highly abundant sequences, and presence of adapter sequences.

MultiQC Report - provides an overview of the quality, trimming, mapping, PCA, and many informative statistics of all files in the experiment in a single report.

Connect to the work folder via HPC-FS (See session 2). Browse to the fastqc output folder: run1_star_salmonresultsmultiqc.

Task 2: Run the nextflow nf-core/rnaseq pipeline by including advanced filtering parameters

Requirements:

  • index.csv → a file that provides a list of sample IDs and their associated FASTQ files (read 1 and read 2)

  • launch.pbs → a script to submit the job to the HPC cluster

Example index.csv file for nf-core/rnaseq version 3.3:

Code Block
group,fastq_1,fastq_2,strandedness
control_r1,/work/kenna_team/data/raw_data/SRR1039508_1.fastq.gz,/work/kenna_team/data/raw_data/SRR1039508_2.fastq.gz,unstranded
dex_r1,/work/kenna_team/data/raw_data/SRR1039509_1.fastq.gz,/work/kenna_team/data/raw_data/SRR1039509_2.fastq.gz,unstranded
control_r2,/work/kenna_team/data/raw_data/SRR1039512_1.fastq.gz,/work/kenna_team/data/raw_data/SRR1039512_2.fastq.gz,unstranded
dex_r2,/work/kenna_team/data/raw_data/SRR1039513_1.fastq.gz,/work/kenna_team/data/raw_data/SRR1039513_2.fastq.gz,unstranded
control_r3,/work/kenna_team/data/raw_data/SRR1039516_1.fastq.gz,/work/kenna_team/data/raw_data/SRR1039516_2.fastq.gz,unstranded
dex_r3,/work/kenna_team/data/raw_data/SRR1039517_1.fastq.gz,/work/kenna_team/data/raw_data/SRR1039517_2.fastq.gz,unstranded
control_r4,/work/kenna_team/data/raw_data/SRR1039520_1.fastq.gz,/work/kenna_team/data/raw_data/SRR1039520_2.fastq.gz,unstranded
dex_r4,/work/kenna_team/data/raw_data/SRR1039521_1.fastq.gz,/work/kenna_team/data/raw_data/SRR1039521_2.fastq.gz,unstranded

Example of a launch.pbs script with advanced parameter options:

Code Block
#!/bin/bash -l
#PBS -N nfrnaseq
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00

#Use the current directory to run the workflow
cd $PBS_O_WORKDIR

module load java
NXF_OPTS='-Xms1g -Xmx4g'

#run the nextflow RNA-seq pipeline:
nextflow run nf-core/rnaseq -profile singularity -r 3.3 --aligner star_salmon --input index.csv --genome GRCh38 --clip_r1 10 --clip_r2 10 --three_prime_clip_r1 2 --three_prime_clip_r2 2 --save_trimmed -resume

#allow access to others in the group
chmod -R g+rwX results
chmod -R g+rwX work

Session 3 exercises:

  1. Run the nf-core/rnaseq pipeline using the Airway smooth muscle public data (PMID: 24926665. GEO: GSE52778) - aligner option set to ‘star_salmon

  2. Same as above but aligner option set to ‘star_rsem

Create a new working folder:

Code Block
mkdir session3
cd session3

mkdir run1_star_salmon
cd run1_star_salmon

Copy index.csv and launch.pbs files to the newly created folder

Code Block
cp /work/kenna_team/scripts/star_salmon/session3/* .

Check that files were copied into the new working folder

Code Block
ls -a
./ ../ index.csv  launch.pbs

#verify the content of index.csv
cat index.csv

#also check the PBS Pro submission script
cat launch.pbs

Run the workflow:

Code Block
qsub launch.pbs

Monitor the progress of the workflow:

Code Block
qjobs

or

Code Block
qstats -u userID

Repeat the above process for ‘star_rsem’

The only variation is copying the index.csv and launch.pbs script. As follows:

Code Block
cp /work/kenna_team/scripts/star_rsem/session3/* .

Task 3 - Differential gene expression analysis

Differential expression analysis using https://maayanlab.cloud/biojupies/analyze

From session 2, we will use the feature counts generated using the ‘star_salmon’ parameter setting. Either copy the relevant files (see below) to your laptop from the HPC or download the files below.

File

File name

File

Gene feature counts

salmon_merged.gene_counts.tsv

View file
namesalmon.merged.gene_counts.tsv

Transcript feature counts

salmon_merged.transcript_counts.tsv

View file
namesalmon.merged.transcript_counts.tsv

Reading:

https://www.nature.com/articles/srep25533

Create index file using Unix command line:

Code Block
#Raw data folder
cd /work/kenna_team/data/raw_data

#STEP1: get list of read 1 files and save it into a file1.txt file 
ls *_1* > file1.txt

#STEP2: get list of read 2 files and save it into a file2.txt file  
ls *_2* > file2.txt
 
#(Optional) Alternative to the above steps, when color in terminal adds non ASCII character to file names
ls --color=never *_1* > file1.txt
ls --color=never *_2* > file2.txt
 
#STEP3: add path to FASTQ files into the file1.txt and file2.txt files
cat file1.txt | awk '{print "/work/kenna_team/data/raw_data/" $1}' > file1.mod

cat file2.txt | awk '{print "/work/kenna_team/data/raw_data/" $1}' > file2.mod
 
#STEP4: merge modified file1.mod and file2.mod into the same file
paste file1.mod file2.mod > combine_file1_2.mod
 
#STEP5: create the names of samples manually using vi or nano. Alternatively, can do this using excel.
vi sample_names.txt
 
#STEP6: combine sample_names.txt and combine_file1_2.mod files
paste sample_names.txt combine_file1_2.mod > combine_sampleName_file1_2.mod       
 
#STEP7: finalise the index.csv file
cat combine_sampleName_file1_2.mod | awk '{print $1 "," $2 "," $3 ",unstranded" }' > index.csv

Follow up task:

Once the session 3 run_start_salmon has completed, run task 3 above with the newly generated feature counts. Do you observe the same findings?