Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents
stylenone

Workflow management systems

Analysing data involves a sequence of tasks which is referred to as a workflow or a pipeline. These workflows typically require executing multiple software packages, sometimes running on different computing environments, such as a desktop or a compute cluster. Traditionally these workflows have been joined together in scripts using general purpose programming languages such as Bash or Python. However, as workflows become larger and more complex, the management of the programming logic and software becomes difficult.

Workflow Management Systems (WfMS) such as Snakemake, Galaxy, and Nextflow have been developed specifically to manage computational data-analysis workflows in fields such as bioinformatics, imaging, physics, and chemistry. These systems contain multiple features that simplify the development, monitoring, execution and sharing of pipelines, such as:

  • Run time management

  • Software management

  • Portability & Interoperability

  • Reproducibility

  • Re-entrancy

image-20240923-105230.pngImage Added

What is Nextflow?

  • Nextflow is a free and open-source pipeline management software that enables scalable and reproducible scientific workflows. It allows the adaptation of pipelines written in the most common scripting languages.

  • Key features of Nextflow:

    • Reproducible → version control and use of containers ensure the reproducibility of nextflow pipelines

    • Portable → compute agnostic (i.e., HPC, cloud, desktop)

    • Scalable → run from a single to thousands of samples

    • Continuous checkpoints & re-entrancy → allows you to resume its execution from the last successfully executed step

    • Minimal digital literacy → accessible to anyone

    • Active global community → more and more nextflow pipelines are available (i.e., https://nf-co.re/pipelines)

...

Nextflow is a pipeline engine that can take advantage of the batch nature of the HPC environment to efficiently and quickly run Bioinformatic bioinformatic workflows.

For more information about Nextflow, please visit Nextflow - A DSL for parallel and scalable computational pipelines

...

To install Nextflow for the first time, copy and paste the following block of code into your terminal (i.e., PuTTy that is already connected to the terminal) and hit 'enter':

Code Block
curl -s https://get.nextflow.io | bash
mv nextflow $HOME/bin
  • Line 1: This command downloads and assembles the parts of nextflow - this step might take some time.

  • Line 2: When finished, the nextflow binary will be in the current folder so it should be moved to your “bin” folder” so it can be found later.

Updating Nextflow

If you have installed Nextflow before on the HPC then you will have to run:

...

Code Block
mkdir $HOME/nftemp && cd $HOME/nftemp
nextflow run hello
  • Line 1: Make a temporary folder called nftemp for Nextflow to create files when it runs the hello pipeline; change directory to this newly created folder.

  • Line 2: Verify Nextflow is working.

You should see something like this:

...

If you got this output, well done! You have run your first Nextflow pipeline successfully.

Troubleshooting: Please note that if you have run the Hello pipeline before, you might need to run the following command instead:

Code Block
nextflow update hello

Now go back to your home directory and clean the test folder.

...

Code Block
[[ -d $HOME/.nextflow ]] || mkdir -p $HOME/.nextflow

cat <<EOF > $HOME/.nextflow/config
singularity {
    cacheDir = '$HOME/.nextflow/NXF_SINGULARITY_CACHEDIR'
    autoMounts = true
}
conda {
    cacheDir = '$HOME/.nextflow/NXF_CONDA_CACHEDIR'
}
process {
  executor = 'pbspro'
  scratch = false
  cleanup = false
}
includeConfig '/work/datasets/reference/nextflow/qutgenome.config'
EOF
  • Line 1: Check if a .nextflow/config file already exists in your home directory. Create it if it does not exist

  • Line 2-15: Using the cat command, paste text in the newly created .nextflow/config file which specifies the cache location for your singularity and conda.

  • What are the parameters you are setting?

  • Line 3-6 set the directory where remote Singularity images are stored and direct Nextflow to automatically mount host paths in the executed container.

  • Line 7-9 set the directory where Conda environments are stored.

  • Line 10-14 sets default directives for processes in your pipeline. Note that the executor is set to pbspro on line 11.

  • Line 15 provides the local path to genome files required for pipelines such as nf-core/rnaseq

More in depth information on Nextflow configuration is described here: https://www.nextflow.io/docs/latest/config.html.

...