Version 3.11.2

This page provides a guide to QUT users on how to install and run the nextflow nf-core/rnaseq workflow on the HPC.

Pre-requisites

Basic Unix command line knowledge (example: https://researchcomputing.princeton.edu/education/external-online-resources/linux ; https://swcarpentry.github.io/shell-novice/ )
- https://sandbox.bio/
Familiarity with one unix text editors (example Vi/Vim or Nano):
- VIM ( https://bioinformatics.uconn.edu/vim-guide/ ; https://missing.csail.mit.edu/2020/editors/)
- Nano (https://engineering.purdue.edu/ECN/Support/KB/Docs/BasictutorialforNanou ; https://www.howtogeek.com/howto/42980/the-beginners-guide-to-nano-the-linux-command-line-text-editor/ )
Have an HPC account on QUT’s HPC compute. Apply for a new HPC account here.
R tutorials:
- https://girke.bioinformatics.ucr.edu/GEN242/tutorials/

Install Nextflow

The nf-core/rnaseq workflow requires Nextflow to be installed in your account on the HPC. Find details on how to install and test Nextflow here. Prepare a nextflow.config file and run a PBS pro submission script for Nextflow pipelines.

Additional information is available here: https://nf-co.re/usage/installation

Additional details on the workflow can be found at:

Overview: https://nf-co.re/rnaseq/3.11.2

Usage: https://nf-co.re/rnaseq/3.11.2/usage

GitHub: https://github.com/nf-core/rnaseq

Pipeline Summary

The pipeline is built using Nextflow and processes data using the following steps:

Preprocessing
- cat - Merge re-sequenced FastQ files
- FastQC - Raw read QC
- UMI-tools extract - UMI barcode extraction
- TrimGalore - Adapter and quality trimming
- fastp - Adapter and quality trimming
- BBSplit - Removal of genome contaminants
- SortMeRNA - Removal of ribosomal RNA
Alignment and quantification
- STAR and Salmon - Fast spliced aware genome alignment and transcriptome quantification
- STAR via RSEM - Alignment and quantification of expression levels
- HISAT2 - Memory efficient splice aware alignment to a reference
Alignment post-processing
- SAMtools - Sort and index alignments
- UMI-tools dedup - UMI-based deduplication
- picard MarkDuplicates - Duplicate read marking
Other steps
- StringTie - Transcript assembly and quantification
- BEDTools and bedGraphToBigWig - Create bigWig coverage files
Quality control
- RSeQC - Various RNA-seq QC metrics
- Qualimap - Various RNA-seq QC metrics
- dupRadar - Assessment of technical / biological read duplication
- Preseq - Estimation of library complexity
- featureCounts - Read counting relative to gene biotype
- DESeq2 - PCA plot and sample pairwise distance heatmap and dendrogram
- MultiQC - Present QC for raw reads, alignment, read counting and sample similiarity
Pseudo-alignment and quantification
- Salmon - Wicked fast gene and isoform quantification relative to the transcriptome
Workflow reporting and genomes
- Reference genome files - Saving reference genome indices/files
- Pipeline information - Report metrics generated during the workflow execution

Getting Started

Download and run the workflow using minimal data provided by nf-core/rnaseq. We recommend using singularity as the profile for QUT’s HPC. Another profile option can be ‘conda.’ Note: the profile option ‘docker’ is unavailable on the HPC.

nextflow run nf-core/rnaseq -profile test,singularity --outdir results -r 3.11.2

Preparing a ‘samplesheet.csv’ file

Prepare a sample sheet file that specifies the input files to be used. To do this, we use an nf-core script to generate the ‘samplesheet.csv’ file as follows (setting strandedness to auto allows the pipeline to determine the strandedness of your RNA-seq data automatically):

#load python 3.10
module load python/3.10.8-gcccore-12.2.0

#download script and make executable
wget -L https://raw.githubusercontent.com/nf-core/rnaseq/master/bin/fastq_dir_to_samplesheet.py
chmod +x fastq_dir_to_samplesheet.py

#generate the samplesheet.csv file
./fastq_dir_to_samplesheet.py /path/to/directory/containing/fastq_files/ samplesheet.csv \
    --strandedness auto \
    --read1_extension _R1.fastq.gz \
    --read2_extension _R2.fastq.gz

Example of 'samplesheet.csv' required for nf-core/rnaseq pipeline version 3.11.2:

sample,fastq_1,fastq_2,strandedness
control_1,/path/to/directory/containing/fastq_files/control-1_R1.fastq.gz,/path/to/directory/containing/fastq_files/control-1_R2.fastq.gz,auto
control_2,/path/to/directory/containing/fastq_files/control-2_R1.fastq.gz,/path/to/directory/containing/fastq_files/control-2_R2.fastq.gz,auto
control_3,/path/to/directory/containing/fastq_files/control-3_R1.fastq.gz,/path/to/directory/containing/fastq_files/control-3_R2.fastq.gz,auto
infected_1,/path/to/directory/containing/fastq_files/infected-1_R1.fastq.gz,/path/to/directory/containing/fastq_files/infected-1_R2.fastq.gz,auto
infected_2,/path/to/directory/containing/fastq_files/infected-1_R1.fastq.gz,/path/to/directory/containing/fastq_files/infected-2_R2.fastq.gz,auto
infected_3,/path/to/directory/containing/fastq_files/infected-1_R1.fastq.gz,/path/to/directory/containing/fastq_files/infected-3_R2.fastq.gz,auto

Preparing to run on the HPC

To run this on the HPC a PBS submission script needs to be created using a text editor. For example, create a file called launch.pbs using a text editor of choice (i.e., vi or nano) and then copy and paste the code below:

#!/bin/bash -l
#PBS -N nfrna2
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00

#work on current directory (folder)
cd $PBS_O_WORKDIR

#load java and set up memory settings to run nextflow
module load java
NXF_OPTS='-Xms1g -Xmx4g'

#run the rnaseq pipeline
nextflow run nf-core/rnaseq \
      -profile singularity \
      -r 3.11.2 \
      --input samplesheet.csv \
      --genome GRCh38 \
      --outdir results \
      --aligner star_salmon

This script will run version 3.11.2 of the nf-core/rnaseq pipeline on RNA-seq data from the ‘samplesheet.csv’ file. You can see here that the only truly compulsory parameter is the output directory. However, you must specify the ‘singularity’ profile to run it on this HPC. In addition, you need to select a reference genome. We recommend using one from the AWS iGenomes repository if available (you can find the list of available genomes in this config file), but other reference genome options are available too. Finally, in version 3.11.2, setting the ‘aligner’ parameter is unnecessary unless you want to use an option other than ‘star_salmon' (default). However, specifying it is not a mistake.

To submit the script to PBS, follow the instructions at the bottom of the page (section Submitting the job).

However, before you do it, consider running the version of the pipeline that will preprocess reads and then adjust the Trim Galore options (described below).

Reads preprocessing

We recommend running the nextflow nf-core/rnaseq pipeline once and then assessing the fastqc results folder to assess if sequence biases are present in the 5'-end and 3'-end ends of the sequences. In version 3.11.2, there is no option to run only quality control processes. Instead, it is possible to force the pipeline to skip reads trimming and alignment, and quantify the data using pseudo-aligned reads (pink path on the image above) - this will reduce significantly the first run of the pipeline. To execute that option, add the following flags to your nextflow run nf-core/rnaseq command: --skip_trimming, --skip_alignment and select which method should perform pseudo-alignment.

nextflow run nf-core/rnaseq \
      -profile singularity \
      -r 3.11.2 \
      --input samplesheet.csv \
      --outdir results \
      --genome GRCh38 \
      --skip_trimming \
      --skip_alignment \
      --pseudo_aligner salmon

Then, we can use the PBS script below to tell the pipeline to remove a defined number of bases from the 5'-end (--clip_r1 or --clip_r2) or 3'-end (--three_prime_clip_r1 or --three_prime_clip_r2). Also, we can specify to remove ribosomal RNA as these sets of sequences are non-informative (more details about this and other read filtering options in the guide).

You can experiment with different clipping options. To do this, use the nextflow run nf-core/rnaseq command with--skip_alignment like at the beginning when you were only assessing the quality of the reads but this time, do not use--skip_trimming flag. For example, if the FastQC report suggests that you only need to clip 10 bases from the 5' end, modify the nextflow run nf-core/rnaseq in the PBS script in the following way:

nextflow run nf-core/rnaseq --input samplesheet.csv \
        --outdir results \
        -r 3.11.2 \
        --genome GRCh38 \
        -profile singularity \
        --extra_trimgalore_args "--clip_r1 10 --clip_r2 10 " \
        --skip_alignment \
        --pseudo_aligner salmon

Adjusting the Trim Galore options

When the initial trimming is done, verify if any more clipping needs to be done and run the nf-core/rnaseq pipeline that will perform all the steps. For example:

nextflow run nf-core/rnaseq --input samplesheet.csv \
        --outdir results \
        -r 3.11.2 \
        --genome GRCh38 \
        -profile singularity \
        --aligner star_salmon \
        --extra_trimgalore_args "--clip_r1 12 --clip_r2 12 --three_prime_clip_r1 2 --three_prime_clip_r2 2 "

Submitting the job

Once you have created the folder for the run, the samplesheet.csv file, nextflow.config, and launch.pbs, you are ready to submit.

Submit the run with this command (On Lyra)

qsub launch.pbs

Monitoring the Run

You can use the command

qstat -u $USER

Alternatively, use the command

qjobs

to check on the jobs, you are running. Nextflow will launch additional jobs during the run.

You can also check the .nextflow.log file for details on what is going on.

Finally, if you have configured the connection to the NFTower, you can log on and check your run.

Troubleshooting

Add output folders/files

sample data

If the running was interrupted or you did not complete a particular step, or you want to modify a parameter for a particular step, instead of re-running all processes again, nextflow enables you to “-resume” the workflow.

nextflow run nf-core/rnaseq --input samplesheet.csv \
        --outdir results \
        -r 3.11.2 \
        --genome GRCh38 \
        -profile singularity \
        --aligner star_salmon \
        --extra_trimgalore_args "--clip_r1 12 --clip_r2 12 --three_prime_clip_r1 2 --three_prime_clip_r2 2 " \
        -resume