Content Comparison

Table of Contents

style	none

...

What is Nextflow?

Analysing data involves a sequence of tasks which is referred to as a workflow or a pipeline. These workflows typically require executing multiple software packages, sometimes running on different computing environments, such as a desktop or a compute cluster. Traditionally these workflows have been joined together in scripts using general purpose programming languages such as Bash or Python. However, as workflows become larger and more complex, the management of the programming logic and software becomes difficult.Workflow Management Systems (WfMS) such as Snakemake, Galaxy, and Nextflow have been developed specifically to manage computational data-analysis workflows in fields such as bioinformatics, imaging, physics, and chemistry. These systems contain multiple features that simplify the development, monitoring, execution and sharing of pipelines, such as:

...

Run time management

...

Software management

...

Portability & Interoperability

...

Re-entrancy

...

Nextflow is a free and open-source pipeline management software that enables scalable and reproducible scientific workflows. It allows the adaptation of pipelines written in the most common scripting languages.
Key features of Nextflow that simplify the development, monitoring, execution and sharing of pipelines:
- Reproducible → version control and use of containers ensure the reproducibility of nextflow pipelines
- Portable → compute agnostic (i.e., HPC, cloud, desktop)
- Time and resource management
- Scalable → run from a single to thousands of samples
- Continuous checkpoints & re-entrancy → allows you to resume its execution from the last successfully executed step
- Minimal digital literacy → accessible to anyone
- Active global community → more and more nextflow pipelines are available (i.e., https://nf-co.re/pipelines)

...

Nextflow is a pipeline engine that can take advantage of the batch nature of the HPC environment to efficiently and quickly run bioinformatic workflows.
For more information about Nextflow, please visit Nextflow - A DSL for parallel and scalable computational pipelines

Image Added

Installing Nextflow

Connect to your Lyra account. Nextflow is meant to run from your home folder on a Linux machine like the HPC.

...

Code Block
ssh [username]@lyra.qut.edu.au

Before we start using the HPC, let’s start an interactive session:

Info
Not familiar with launching an interactive jobs and submitting PBS jobs, please review the Submitting PBS Jobs part 1 section of the Intro to HPC.

Code Block
qsub -I -S /bin/bash -l walltime=10:00:00 -l select=1:ncpus=1:mem=4gb

This might take a few minutes to start

You will see this message first:

...

Followed by:

...

You can check that your interactive window is active by running the command:

Code Block
qstat -u [username]

...

Nextflow also requires Java 11 or later to be installed. To load java, run the following command:

Code Block
module load java

You should be in your home directory, if unsure you can run the following command:

Code Block
cd ~

Info
Not familiar with the module function? Please review the Modules section section of the Intro to HPC.

Finally we will create a folder which will contain all the exercises and code from today:

Code Block
mkdir -p $HOME/workshop/2024-2/session3 cd $HOME/workshop/2024-2/session3

Installing Nextflow for the first time

...

To install Nextflow for the first time, copy and paste the following block of code into your terminal (i.e., PuTTy that is already connected to the terminal) and hit 'enter':

Code Block
curl -s https://get.nextflow.io \| bash mv nextflow $HOME/bin

Line 1: This command downloads and assembles the parts of nextflow - this step might take some time.
Line 2: When finished, the nextflow binary will be in the current folder so it should be moved to your “bin” folder” so it can be found later.

...

Updating Nextflow

If you have installed Nextflow before on the HPC then you will have to run:

Code Block
nextflow self-update

Check that your Nextflow installation worked

To verify that Nextflow is installed properly, you can run the following command:

Code Block
nextflow info

We will now also run locally your first Nextflow pipeline, which is called Hello:

Code Block
mkdir -p $HOME/workshop/2024-2/session3/nftemp && cd $HOME/workshop/2024-2/session3/nftemp nextflow run hello

Line 12: Make a temporary folder called nftemp for Nextflow to create files when it runs the hello pipeline; change directory to this newly created folder.
Line 23: Verify Nextflow is working.

You should see something like this:

...

If you got this output, well done! You have run your first Nextflow pipeline successfully.

Note

Troubleshooting:

Please note that if you have run the Hello pipeline before, you might need to update it to the latest version for it to run properly. To do so, you will need to pull the latest code first:

Code Block
nextflow pull hello nextflow run hello

If you see the following error message:

Code Block
WARN: Cannot read project manifest – Cause: Remote resource not found ...

It is likely there is an typo in the command (e.g. pipeline name) you provided and the error message is telling you it is unable to find a pipeline under the name provided. Check your spelling and resubmit.

Now that you have managed to run the hello pipeline, go back to your home directory and clean the test folder.

Code Block
cd $HOME rm -rf $HOME/workshop/2024-2/session3/nftemp

Nextflow’s base configuration

A key Nextflow feature is the ability to decouple the workflow implementation, which describes the flow of data and operations to perform on that data, from the configuration settings required by the underlying execution platform.

This enables the workflow to be portable, allowing it to run on different computational platforms such as an institutional HPC or cloud infrastructure, without needing to modify the workflow implementation.

For instance, a user can configure Nextflow so it runs the pipelines locally (i.e. on the computer where Nextflow is launched), which can be useful for developing and testing a pipeline script on your computer. This is the default setting setting in Nextflow.

Code Block
process { executor = 'local' }

or You can also configure Nextflow to run on a cluster cluster such as a PBS Pro resource managermanage, which is the setting we will use on the HPC:

Code Block
process { executor = 'pbspro' }

...

Code Block

[[ -d $HOME/.nextflow ]] || mkdir -p $HOME/.nextflow

cat <<EOF > $HOME/.nextflow/config
singularity {
    cacheDir = '$HOME/.nextflow/NXF_SINGULARITY_CACHEDIR'
    autoMounts = true
}
conda {
    cacheDir = '$HOME/.nextflow/NXF_CONDA_CACHEDIR'
}
process {
  executor = 'pbspro'
  scratch = false
  cleanup = false
}
includeConfig '/work/datasets/reference/nextflow/qutgenome.config'
EOF

Line 1: Check if a .nextflow/config file already exists in your home directory. Create it if it does not exist
Line 2-15: Using the cat command, paste text in the newly created .nextflow/config file which specifies the cache location for your singularity and conda.
What are the parameters you are setting?
Line 34-6 7 set the directory where remote Singularity images are stored and direct Nextflow to automatically mount host paths in the executed container.
Line 78-9 10 set the directory where Conda environments are stored.
Line 1011-14 15 sets default directives for processes in your pipeline. Note that the executor is set to pbspro on line 1112.
Line 15 16 provides the local path to genome files required for pipelines such as nf-core/rnaseq

Info
More in depth information on Nextflow configuration is described here: https://www.nextflow.io/docs/latest/config.html.

Version	Old Version 11	New Version Current
Changes made by	Marie-Emilie Gauthier	Marie-Emilie Gauthier
Saved on	Sept 23, 2024	Sept 29, 2024