Nextflow

Nextflow is a pipeline engine that can take advantage of the batch nature of the HPC environment to efficiently and quickly run Bioinformatic workflows.

For more information about Nextflow, please visit Nextflow - A DSL for parallel and scalable computational pipelines

Installing Nextflow

Nextflow is meant to run from your home folder on a linux machine like the HPC.

A few commands can install Nextflow

module load java
curl -s https://get.nextflow.io | bash
mv nextflow $HOME/bin
#verify Nextflow is installed
mkdir $HOME/nextflow && cd $HOME/nextflow
nextflow run hello

Line 1: The module load command is necessary to ensure java is available

Line 2: This command downloads and assembles the parts of nextflow - this step might take some time.

Line 3: When finished, the nextflow binary will be in the current folder so it should be moved to your “bin” folder” so it can be found later.

Line 5: Nextflow creates files when it runs so make a folder to store these files.

Line 6: Verify Nextflow is working.

Nextflow’s Default Configuration

Once Nextflow is installed, there are some settings that should be applied to take advantage of the HPC environment. Nextflow has a hierarchy of configuration files, where the base configuration that is applied to every workflow you run is here:

$HOME/.nextflow/config

A sample file might look like:
(Replace $HOME with the actual path to your home folder)

process {
  executor = 'pbspro'
  beforeScript = {
    """
    source $HOME/.bashrc
    source $HOME/.profile
    """
    }
  scratch = true
  cleanup = false
}

singularity {
    cacheDir = '$HOME/.nextflow/NXF_SINGULARITY_CACHEDIR'
    autoMounts = true
}

conda {
    cacheDir = '$HOME/.nextflow/NXF_CONDA_CACHEDIR'
}

What do all these lines mean?

Lines 1-10: This process section has the defaults that ensure Nextflow can submit jobs to the PBS system on the HPC

Lines 12-15, 17-19: When Nextflow runs a pipeline, the software needed it typically delivered by a conda environment or a singularity container. Nextflow will build the environment or download the file to the pipeline’s work directory. If you run the pipeline multiple times, there is a delay while this environment is built. These settings ensure the environment is created in a central location thus speeding up subsequent runs. Nextflow does not understand the $HOME variable so please replace with your home folder. You can find this after you logon with the command 'pwd'.

Create the CACHE directories defined in the .nexflow/config above in you home space:

mkdir .nextflow/NXF_SINGULARITY_CACHEDIR
mkdir .nextflow/NXF_CONDA_CACHEDIR

Cut and paste the following text into your Linux command line and hit ‘enter’. This will make a few small changes to your local account so that Nextflow can run correctly.

[[ -d $HOME/.nextflow ]] || mkdir -p $HOME/.nextflow
cat <<EOF > $HOME/.nextflow/config
singularity {
    cacheDir = '$HOME/.nextflow/NXF_SINGULARITY_CACHEDIR'
    autoMounts = true
}
conda {
    cacheDir = '$HOME/.nextflow/NXF_CONDA_CACHEDIR'
}
process {
  executor = 'pbspro'
  scratch = true
  cleanup = false
}
EOF

Preparing Data

Typically the pipeline you want to run will want the data prepared in a particular way. You can check the pipeline’s help or web site for a guide. Accessing help is typically:

nextflow run rnaseq --help

Some pipelines may need file names, and others may want a CSV file with file names and paths and other information.

Running Nextflow

When you run Nextflow, it is a good idea to create a folder for the run - this keeps all the files separate and easy to manage.

When Nextflow runs it creates a work folder where all the temporary and work in progress files are stored and a results folder where the output of the pipeline run is typically stored.

Once you have prepared your input data for the pipeline you are ready to run the pipeline.

Nextflow is run with the command (After changing to the run folder):

nextflow run {pipeline name} {options}

However, it is good practice and much safer to submit a job on the HPC to run Nextflow on your pipeline. A job file might look like:

#!/bin/bash -l
#PBS -N MyNextflowRun
#PBS -l select=1:ncpus=2:mem=6gb
#PBS -l walltime=24:00:00
cd $PBS_O_WORKDIR
module load java
NXF_OPTS='-Xms1g -Xmx4g'
nextflow run nf-core/rnaseq

What do these lines mean?

Lines 1-5 are typical PBS system commands, here the name of the job is MyNextflowRun, 2 CPUS and 4gb of ram is selected, and the job will run for 24 hours.

Line 6 is to ensure the java environment is available (Nextflow needs Java to run)

Line 7 tells Nextflow how much ram to use

Line 8 runs Nextflow.

To see the output of Nextflow while running as a job you can use the Nextflow Tower.

Nextflow

Installing Nextflow

Nextflow’s Default Configuration

Preparing Data

Running Nextflow

Using the Nextflow Tower