What is Nextflow?

Analysing data involves a sequence of tasks which is referred to as a workflow or a pipeline. These workflows typically require executing multiple software packages, sometimes running on different computing environments, such as a desktop or a compute cluster. Traditionally these workflows have been joined together in scripts using general purpose programming languages such as Bash or Python. However, as workflows become larger and more complex, the management of the programming logic and software becomes difficult.

image-20240923-105230.png

image-20230919-014514.png

Installing Nextflow

  1. Connect to your Lyra account. Nextflow is meant to run from your home folder on a Linux machine like the HPC.

ssh [username]@lyra.qut.edu.au
  1. Before we start using the HPC, let’s start an interactive session:

Not familiar with launching an interactive jobs and submitting PBS jobs, please review the Submitting PBS Jobs part 1 section of the Intro to HPC.

qsub -I -S /bin/bash -l walltime=10:00:00 -l select=1:ncpus=1:mem=4gb

This might take a few minutes to start

You will see this message first:

Followed by:

You can check that your interactive window is active by running the command:

qstat -u [username]
  1. Nextflow also requires Java 11 or later to be installed. To load java, run the following command:

module load java

Not familiar with the module function? Please review the Modules section section of the Intro to HPC.

  1. Finally we will create a folder which will contain all the exercises and code from today:

mkdir -p $HOME/workshop/2024-2/session3
cd $HOME/workshop/2024-2/session3

Installing Nextflow for the first time

Important: Please note if you already installed Nextflow before on the HPC, then skip this section and go directly to the next section Updating Nextflow.

To install Nextflow for the first time, copy and paste the following block of code into your terminal (i.e., PuTTy that is already connected to the terminal) and hit 'enter':

curl -s https://get.nextflow.io | bash
mv nextflow $HOME/bin

Updating Nextflow

If you have installed Nextflow before on the HPC then you will have to run:

nextflow self-update

Check that your Nextflow installation worked

To verify that Nextflow is installed properly, you can run the following command:

nextflow info

We will also run locally your first Nextflow pipeline, which is called Hello:

mkdir -p $HOME/workshop/2024-2/session3/nftemp && cd $HOME/workshop/2024-2/session3/nftemp
nextflow run hello

You should see something like this:

image-20230919-021023.png

If you got this output, well done! You have run your first Nextflow pipeline successfully.

Troubleshooting:

  • Please note that if you have run the Hello pipeline before, you might need to update it to the latest version for it to run properly. To do so, you will need to pull the latest code first:

nextflow pull hello
nextflow run hello
  • If you see the following error message:

WARN: Cannot read project manifest – Cause: Remote resource not found ...

It is likely there is an typo in the command (e.g. pipeline name) you provided and the error message is telling you it is unable to find a pipeline under the name provided. Check your spelling and resubmit.

Now that you have managed to run the hello pipeline, go back to your home directory and clean the test folder.

 rm -rf $HOME/workshop/2024-2/session3/nftemp

Nextflow’s base configuration

A key Nextflow feature is the ability to decouple the workflow implementation, which describes the flow of data and operations to perform on that data, from the configuration settings required by the underlying execution platform.

This enables the workflow to be portable, allowing it to run on different computational platforms such as an institutional HPC or cloud infrastructure, without needing to modify the workflow implementation.

For instance, a user can configure Nextflow so it runs the pipelines locally (i.e. on the computer where Nextflow is launched), which can be useful for developing and testing a pipeline script on your computer. This is the default setting in Nextflow.

process {
  executor = 'local'
}

You can also configure Nextflow to run on a cluster such as a PBS Pro resource manage, which is the setting we will use on the HPC:

process {
  executor = 'pbspro'
}

The base configuration that is applied to every Nextflow workflow you run is located in $HOME/.nextflow/config.

Once you have installed Nextflow on Lyra, there are some settings that should be applied to your $HOME/.nextflow/config to take advantage of the HPC environment at QUT.

To create a suitable config file for use on the QUT HPC, copy and paste the following text into your Linux command line and hit ‘enter’. This will make the necessary changes to your local account so that Nextflow can run correctly:

[[ -d $HOME/.nextflow ]] || mkdir -p $HOME/.nextflow

cat <<EOF > $HOME/.nextflow/config
singularity {
    cacheDir = '$HOME/.nextflow/NXF_SINGULARITY_CACHEDIR'
    autoMounts = true
}
conda {
    cacheDir = '$HOME/.nextflow/NXF_CONDA_CACHEDIR'
}
process {
  executor = 'pbspro'
  scratch = false
  cleanup = false
}
includeConfig '/work/datasets/reference/nextflow/qutgenome.config'
EOF

More in depth information on Nextflow configuration is described here: https://www.nextflow.io/docs/latest/config.html.