Task 3: Run nf-core/rnaseq pipeline

Aim:

Demonstrate how to run the nextflow nf-core/rnaseq pipeline in the HPC cluster. Initially, running a test and then processing real-life data.

Create a folder for running the nf-RNA-seq pipeline

Create a new folder under your ‘myname’ folder called nextflow, and also create a subfolder called ‘run1’ for the first test of the pipeline:

mkdir myname/nextflow
mkdir myname/nextflow/run1

Go to the subfolder to run the RNAseq test

cd myname/nextflow/run1

Case 1: Running a test

Download and run the workflow using minimal data provided by nf-core/rnaseq. We recommend using singularity as the profile for QUT’s HPC. Another profile option can be ‘conda.’ Note: the profile option ‘docker’ is unavailable on the HPC.

nextflow run nf-core/rnaseq -profile test,singularity --outdir results -r 3.12.0

To execute the above command in the HPC cluster prepare a PBS Pro submission script as follows:

#!/bin/bash -l
#PBS -N nfrna2
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00

#work on current directory (folder)
cd $PBS_O_WORKDIR

#load java and set up memory settings to run nextflow
module load java
NXF_OPTS='-Xms1g -Xmx4g'

nextflow run nf-core/rnaseq -profile test,singularity --outdir results -r 3.12.0

Submitting the job

Once you have created the folder for the run, the samplesheet.csv file, nextflow.config, and launch.pbs, you are ready to submit.

Submit the run with this command (On Lyra)

qsub launch.pbs

Monitoring the Run

You can use the command

qstat -u $USER

Alternatively, use the command

qjobs

to check on the jobs you are running. Nextflow will launch additional jobs during the run.

You can also check the .nextflow.log file for details on what is going on.

Case 2: Run RNA-seq QC check

Preparing a ‘samplesheet.csv’ file

Prepare a sample sheet file that specifies the input files to be used. To do this, we use an nf-core script to generate the ‘samplesheet.csv’ file as follows (setting strandedness to auto allows the pipeline to determine the strandedness of your RNA-seq data automatically):

Load module Python 3.10

module load python/3.10.8-gcccore-12.2.0

Download the script for creating the ‘samplesheet.csv’ metadata file.

wget -L https://raw.githubusercontent.com/nf-core/rnaseq/master/bin/fastq_dir_to_samplesheet.py
chmod +x fastq_dir_to_samplesheet.py

If you do not know already, determine the path to where the FASTQ files were downloaded/are located.

cd /myname/data

#then run the following command
pwd

Using a text editor (i.e., VIM), edit the following code with the appropriate path to the files:

#generate the samplesheet.csv file
./fastq_dir_to_samplesheet.py /path/to/directory/containing/fastq_files/ samplesheet.csv \
    --strandedness auto \
    --read1_extension _R1.fastq.gz \
    --read2_extension _R2.fastq.gz

Example of 'samplesheet.csv' required for nf-core/rnaseq pipeline version 3.12.0:

sample,fastq_1,fastq_2,strandedness
control_1,/path/to/directory/containing/fastq_files/control-1_R1.fastq.gz,/path/to/directory/containing/fastq_files/control-1_R2.fastq.gz,auto
control_2,/path/to/directory/containing/fastq_files/control-2_R1.fastq.gz,/path/to/directory/containing/fastq_files/control-2_R2.fastq.gz,auto
control_3,/path/to/directory/containing/fastq_files/control-3_R1.fastq.gz,/path/to/directory/containing/fastq_files/control-3_R2.fastq.gz,auto
infected_1,/path/to/directory/containing/fastq_files/infected-1_R1.fastq.gz,/path/to/directory/containing/fastq_files/infected-1_R2.fastq.gz,auto
infected_2,/path/to/directory/containing/fastq_files/infected-1_R1.fastq.gz,/path/to/directory/containing/fastq_files/infected-2_R2.fastq.gz,auto
infected_3,/path/to/directory/containing/fastq_files/infected-1_R1.fastq.gz,/path/to/directory/containing/fastq_files/infected-3_R2.fastq.gz,auto

Prepare the following ‘launch_QC.pbs’ script:

#!/bin/bash -l
#PBS -N nfrnaseq_QC
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00

#work on current directory (folder)
cd $PBS_O_WORKDIR

#load java and set up memory settings to run nextflow
module load java
NXF_OPTS='-Xms1g -Xmx4g'

nextflow run nf-core/rnaseq \
      -profile singularity \
      -r 3.12.0 \
      --input samplesheet.csv \
      --outdir results \
      --genome GRCh38 \
      --skip_trimming \
      --skip_alignment \
      --skip_pseudo_alignment

We recommend running the nextflow nf-core/rnaseq pipeline once and then assessing the fastqc results folder to assess if sequence biases are present in the 5'-end and 3'-end ends of the sequences. Version 3.12.0 allows running the pipeline to do quality assessment only, without any alignment, read counting or trimming. To execute that option, add the following flags to your nextflow run nf-core/rnaseq command: --skip_trimming, --skip_alignment and --skip_pseudo_alignment.

Submitting the job

Once you have created the folder for the run, the samplesheet.csv file, nextflow.config, and launch.pbs, you are ready to submit.

Submit the run with this command

qsub launch.pbs

Monitoring the Run

You can use the command

qstat -u $USER

Alternatively, use the command

qjobs

to check on the jobs, you are running. Nextflow will launch additional jobs during the run.

You can also check the .nextflow.log file for details on what is going on.

Assessing QC report

Evaluate the nucleotide distributions in the 5'-end and 3'-end of the sequenced reads (Read1 and Read 2). Define how many nucleotides should be trimmed from each region of the sequences. This will inform the parameter setting for Case 3 below.

Case 3: Run RNA-seq pipeline

Adjusting the Trim Galore options

When the initial trimming is done, verify if any more clipping needs to be done and run the nf-core/rnaseq pipeline that will perform all the steps. For example:

nextflow run nf-core/rnaseq --input samplesheet.csv \
        --outdir results \
        -r 3.12.0 \
        --genome GRCh38 \
        -profile singularity \
        --aligner star_salmon \
        --extra_trimgalore_args "--clip_r1 10 --clip_r2 10 --three_prime_clip_r1 1 --three_prime_clip_r2 1 "

Submitting the job

Once you have created the folder for the run, the samplesheet.csv file, nextflow.config, and launch.pbs, you are ready to submit.

Submit the run with this command (On Lyra)

qsub launch.pbs

Monitoring the Run

You can use the command

qstat -u $USER

Alternatively, use the command

qjobs

to check on the jobs, you are running. Nextflow will launch additional jobs during the run.

You can also check the .nextflow.log file for details on what is going on.