3. Running pipelines

Fetching pipeline code

The pull command allows you to download the latest version of a project from a GitHub repository or to update it if that repository has previously been downloaded in your home directory.

nextflow pull nf-core/<pipeline>

Please note that Nextflow would also automatically fetch the pipeline code when you run the command below for the first time:

nextflow run nf-core/<pipeline>

For reproducibility, it is good to explicitly reference the pipeline version number that you wish to use with the -revision/-r flag.

In the example below we are pulling the rnaseq pipeline version 3.12.0

nextflow pull nf-core/rnaseq -revision 3.12.0

Downloaded pipeline projects are stored in the folder $HOME/.nextflow/assets.

Software requirements for pipelines

Nextflow pipeline software dependencies are specified using either Docker, Singularity or Conda. It is Nextflow that handles the downloading of containers and creation of conda environments. This is set using the -profile {docker,singularity,conda} parameter when you run Nextflow.

At QUT, we use singularity so we would specify: -profile singularity.

Install and test that the pipeline installed successfully

Pipelines generally include test code that can be run to make sure installation was successful.

From the command line

By running the Nexflow pipeline on the command line, the progress of the analysis is captured in real-time.

As a first exercise we will download and run the nf-core/smrnaseq which is a bioinformatics best-practice analysis pipeline for Small RNA-Seq. We will use the test data provided by the developers to ensure the pipeline installed successfully. This control dataset contains 8 samples.

Run the following command from your home directory:

cd $HOME/workshop/2024-2/session3
mkdir smrnaseq_cl
cd smrnaseq_cl
export NXF_OPTS='-Xms1g -Xmx4g'
nextflow pull file:///work/training/smrnaseq
nextflow run file:///work/training/smrnaseq -profile test,singularity --outdir results -r 2.3.1

Line 1: Move to the directory created for this workshop.
Line 2: Make a temporary folder called smrnaseq_cl for Nextflow to test the smrnaseq pipeline.
Line 3: Change directory to the newly created folder smrnaseq_cl.
Line 4: In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. We recommend adding the following line to your environment to limit this.
Line 5: Download and run the test code.

This will download the smrnaseq pipeline and then run the test code. It should take ~20-30 minutes to run to completion.

Nextflow will first download the pipeline:

It will then display the version of the pipeline which was downloaded: version 2.3.1.

It will also list all the parameters that differ from the pipeline default.

Before running a process, it will download the required singularity images and required reference and input files for testing.

In the screenshot below, all the jobs which will be run are listed.

We can see that 7 jobs have started:

FASTQ_FASTQC_UMITOOLS_FASTP:FASTQC_RAW: 3 jobs are running, the latest job that started is for sample Clone9_N1
FASTQ_FASTQC_UMITOOLS_FASTP:FASTP: 3 jobs are running, the latest job that started is for sample Clone9_N1
INDEX_GENOME (genome.fa): 1 job has started

At the bottom you can see that 6 files (including test fastq.gz input files and reference files) have also been downloaded.

The jobs have been submitted to the PBS queue.

You can check the full list of jobs that have been submitted at any point in time by opening a separate terminal and using the command:

qstat -u $user

It will display the jobs that have been submitted to the PBS queue.

For example, in the first screenshot, 3 jobs are queued.

And in the second screenshot, 6 jobs are running and 2 are queued.

Not familiar with checking PBS job status? Please refer to Checking on the Job Status section in the Intro to HPC.

Going back to the terminal from which you launched the Nextflow analysis, you can check the nextflow log to see how the analysis is progressing.

For example in the screenshot below, taken half way through the Nextflow analysis, several processes have run to completion for all 8 samples tested.

For example, process FASTQ_FASTQC_UMITOOLS_FASTP:FASTQC_RAW appears as 100% completed (i.e 8 of 8 samples).

The process NFCORE_SMRNASEQ:MIRNA_QUANT:BOWTIE_MAP_MATURE is running for sample Control_N2_mature. This is the first sample of the batch to go through this process.

The process NFCORE_SMRNASEQ:MIRNA_QUANT:BOWTIE_MAP_SEQCLUSTER has already completed for 2 samples and it is running for a third sample Clone9_N3_seqcluster.

This is the output you should get when your Nextflow job has run to completion.

At the bottom, the message ‘Pipeline completed successfully’ will be printed along with the duration, the CPU hours and numbers of jobs that run to completion.

You will see that Nextflow created 2 folders (results and work) if you run the command

ls

You can inspect the results which have been output by typing:

ls results

You will see that the pipeline has placed results under different folders matching the steps/processes that were run:

bbsplit  edger  fastp  fastqc  genome  index  mirtop  mirtrace  multiqc  pipeline_info  salmon  samtools  star_salmon  trimgalore  unmapped

You can browse a couple of results folders to check what sort of outputs were generated by the pipeline.

Tip: If you are having trouble running the nf-core/smrnaseq pipeline, some pre-computed results are provided under the folder /work/training/nextflow_intro/smrnaseq_cl.

You will learn more about how to run the nf-core/smrnaseq pipeline in session 6.

Launching Nextflow using a PBS script

Launching the Nextflow pipeline from the command line enabled us to understand what the pipeline does in real-time. But you have to make sure you keep the terminal page from which you launched the analysis opened until the analysis is done.

So now that you have learnt how to run Nextflow locally, we will use a PBS script to download the nf-core/rnaseq pipeline and test it using their test data. This method is the way we recommend you run Nextflow pipelines on the HPC.

This time we will download and run the nf-core/rnaseq, which is a bioinformatics pipeline that can be used to analyse RNA sequencing data obtained from organisms with a reference genome and annotation. We will also test it using the test data provided by the developers.

Move back into your home directory and create a separate rnaseq_pbsfolder:

mkdir -p $HOME/workshop/2024-2/session3/rnaseq_pbs
cd $HOME/workshop/2024-2/session3/rnaseq_pbs

Create the script file rnaseq_test.sh by running the following command:

cat <<EOF > rnaseq_test.sh
#!/bin/bash -l
#PBS -N nfrnaseq_test
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=6:00:00

cd \$PBS_O_WORKDIR
module load java
NXF_OPTS='-Xms1g -Xmx4g'
nextflow run nf-core/rnaseq -r 3.14.0 -profile test,singularity --outdir results
EOF

Line 3: Set your PBS job name to be nfrnaseq_test
Line 4: Specify memory and CPU resource that you want to allocate to your job
Line 5: Specify that you want to allocate 6h for your job to run to.completion
Line 7: Change directory to $PBS_O_WORKDIR, which is a special environment variable created by PBS. This will be the folder where you ran the qsub command
Line 8: Load java
Line 9: In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. We recommend adding the following line to your environment to limit this
Line 10: Run the nf-core/rnaseq pipeline using the test data provided

You can check the content of the PBS script you just created using the command:

cat rnaseq_test.sh

Make the command executable and then submit your job to the PBS queue by running the following commands:

chmod +x rnaseq_test.sh
qsub rnaseq_test.sh

Once again you can monitor your jobs using the qstat -u $user command.

The test should take ~ 30 min to run.

Once completed, you can check the content of the folder using the command ls

You will see a results folder, along with 2 pbs log files: nfrnaseq_test.e[pbs_job_id] and nfrnaseq_test.o[pbs_job_id].

Not familiar with checking the output of PBS job, review the Checking the Output section of Submitting PBS Jobs part 2 from Intro to HPC.

The nfrnaseq_test.o* file will provide a log of the nextflow processes. If you scroll at the bottom, you can check whether the pipeline ran successfully. You should see something similar to this:

-[nf-core/rnaseq] Pipeline completed successfully -
Completed at: 23-Sep-2024 17:29:26
Duration    : 32m 11s
CPU hours   : 0.7
Succeeded   : 194

PBS Job 10374429.pbs
CPU time  : 00:02:15
Wall time : 00:32:35
Mem usage : 1032552kb

If you are having trouble running the nf-core/rnaseq pipeline, some pre-computed results are provided under the folder /work/training/nextflow_intro/rnaseq_pbs.

You will learn more about how to run the nf-core/rnaseq pipeline in session 4.

More details about options which can be used with the nextflow command can be found here.