Prerequisites

You will require a knowledge of basic Linux/Unix commands to be able to participate effectively in this workshop. If you don’t, please attend the following training [Introduction to HPC].

Getting started with Nextflow

What is a workflow and what are workflow management systems?
Why should I use a workflow management system?
What is Nextflow?
What are the main features of Nextflow?
What are the main components of a Nextflow script?

What is Nextflow?

Nextflow is a free and open-source pipeline management software that enables scalable and reproducible scientific workflows. It allows the adaptation of pipelines written in the most common scripting languages.
Key features of Nextflow:
- Reproducible → version control and use of containers ensure the reproducibility of nextflow pipelines
- Portable → compute agnostic (i.e., HPC, cloud, desktop)
- Scalable → run from a single to thousands of samples
- Minimal digital literacy → accessible to anyone
- Active global community → more and more nextflow pipelines are available (i.e., https://nf-co.re/pipelines )

Nextflow is a pipeline engine that can take advantage of the batch nature of the HPC environment to efficiently and quickly run Bioinformatic workflows.

For more information about Nextflow, please visit Nextflow - A DSL for parallel and scalable computational pipelines

Installing Nextflow

Nextflow is meant to run from your home folder on a Linux machine like the HPC.

First connect to your Lyra account.

Before we start using the HPC, let’s start an interactive session:

qsub -I -S /bin/bash -l walltime=10:00:00 -l select=1:ncpus=1:mem=4gb

You should be in your home directory, if unsure you can run the following command:

cd ~

To install Nextflow, copy and paste the following block of code into your terminal (i.e., PuTTy that is already connected to the terminal) and hit 'enter':

module load java
curl -s https://get.nextflow.io | bash
mv nextflow $HOME/bin

Line 1: The module load command is necessary to ensure java is available
Line 2: This command downloads and assembles the parts of nextflow - this step might take some time.
Line 3: When finished, the nextflow binary will be in the current folder so it should be moved to your “bin” folder” so it can be found later.
Line 5: Make a temporary folder for Nextflow to create files when it runs.
Line 6: Verify Nextflow is working.
Lines 7 and 8: Clean up

To verify that Nextflow is installed properly, you can run locally a simple Nextflow pipeline called Hello:

mkdir $HOME/nftemp && cd $HOME/nftemp
nextflow run hello

You should see something like this:

If you got this output, well done! You have run your first Nextflow pipeline successfully.

Now go back to your home directory and clean the test folder.

cd $HOME
rm -rf nftemp

Nextflow configuration

A key Nextflow feature is the ability to decouple the workflow implementation, which describes the flow of data and operations to perform on that data, from the configuration settings required by the underlying execution platform. This enables the workflow to be portable, allowing it to run on different computational platforms such as an institutional HPC or cloud infrastructure, without needing to modify the workflow implementation.

For instance, a user can configure Nextflow so it runs the pipelines locally (i.e. on the computer where Nextflow is launched), which can be useful for developing and testing a pipeline script on your computer

\\default Nextflow settings
process {
  executor = 'local'
}

or configure Nextflow to run on a cluster such as a PBS Pro resource manager:

process {
  executor = 'pbspro'
}

Information on Nextflow configuration is described in details here: https://www.nextflow.io/docs/latest/config.html

The base configuration that is applied to every workflow you run is located in $HOME/.nextflow/config.

Nextflow’s Default Configuration

Once you have installed Nextflow on Lyra, there are some settings that should be applied to take advantage of the HPC environment at QUT.

You can create a suitable config file for use on the QUT HPC by copying and pasting the following text into your Linux command line and hit ‘enter’. This will make the necessary changes to your local account so that Nextflow can run correctly:

[[ -d $HOME/.nextflow ]] || mkdir -p $HOME/.nextflow
cat <<EOF > $HOME/.nextflow/config
singularity {
    cacheDir = '$HOME/.nextflow/NXF_SINGULARITY_CACHEDIR'
    autoMounts = true
}
conda {
    cacheDir = '$HOME/.nextflow/NXF_CONDA_CACHEDIR'
}
process {
  executor = 'pbspro'
  scratch = false
  cleanup = false
}
EOF

Line 1: Check if a .nextflow file already exists in your home directory. Create it if it does not exist
Line 2-15: Paste text in the newly created .nextflow file which specifies the cache location for your singularity and conda.
What are the parameters?
Line 3-6 set the directory where remote Singularity images are stored and direct Nextflow to automatically mount host paths in the executed container.
Line 7-9 set the directory where Conda environments are stored.
Line 10-14 sets default directives for processes in your pipeline. Note that the executor is set to pbspro on line 11.

Nextflow pipeline repositories

nf-core

What is nf-core?

nf-core is a community-led project to develop a set of best-practice pipelines built using Nextflow workflow management system. Pipelines are governed by a set of guidelines, enforced by community code reviews and automatic code testing. The diagram below showcases the key aspects of nf-core and is divided into three sections:

the Deploy section includes features like Stable pipelines, Centralized configs, List and update pipelines, and Download for offline us.
the Participate section highlights Documentation, Slack workspace, Twitter updates, and Hackathons.
the Develop section emphasizes the Starter template, Code guidelines, CI code linting and tests, and Helper tools.

What are nf-core pipelines?

nf-core pipelines are an organised collection of Nextflow scripts, other non-nextflow scripts (written in any language), configuration files, software specifications, and documentation hosted on GitHub. There is generally a single pipeline for a given data and analysis type e.g. There is a single pipeline for bulk RNA-Seq. All nf-core pipelines are open source.

Searching for available nf-core pipelines

Go to https://nf-co.re/pipelines

Narrow search by typing relevant term, for example ‘rna-seq’:

Pipelines can be sorted by Latest release, Name or Stars:

Examples of pipelines used at QUT:

nf-core/ampliseq

nf-core/smrnaseq
nf-core/rnaseq
nf-core/sarek

nf-core support

For support with Nextflow, see https://nf-co.re/join. For instance, there is a very active slack community for nf-core users.

epi2me workflows

EPI2ME Labs maintains a collection of bioinformatics workflows tailored to Oxford Nanopore Technologies long-read sequencing data. They are curated and actively maintained by experts in long-read sequence analysis.

https://eresearchqut.atlassian.net/wiki/spaces/EG/pages/edit-v2/2261090311#epi2me

Examples of pipelines used at QUT:

wf-metagenomics

Running pipelines

Fetching pipeline code

The pull command allows you to download the latest version of a project from a GitHub repository or to update it if that repository hadDownloaded pipeline projects are stored in the folder $HOME/.nextflow/assets in your computer. already been downloaded.

nextflow pull nf-core/<PIPELINE>

Please do not run the command below, but note that Nextflow would also automatically fetch the pipeline code when you run the command below for the first time:

nextflow run nf-core/<pipeline>

For reproducibility, it is good to explicitly reference the pipeline version number that you wish to use with the -revision/-r flag.

In the example below we are pulling the rnaseq pipeline version 3.12.0

nextflow pull nf-core/rnaseq -revision 3.12.0

We can see from the output we have the latest release.

Downloaded pipeline projects are stored in the folder $HOME/.nextflow/assets in your computer.

Software requirements for pipelines

Nextflow pipeline software dependencies are specified using either Docker, Singularity or Conda. It is Nextflow that handles the downloading of containers and creation of conda environments. This is set using the -profile {docker,singularity,conda} parameter when you run Nextflow. At QUT, we use singularity so we would specify: -profile singularity.

Test that the pipeline installed successfully

Pipelines generally include test code that can be run to make sure installation was successful.

From the command line

Run the following command from your home directory:

cd
nextflow run nf-core/smrnaseq -profile test,singularity --outdir results -r 2.1.0

This will download the smrnaseq pipeline and then run the test code. It should take ~20-30 minutes to run to completion.

It will fist display the version of the pipeline which was downloaded: version 2.1.0

It will then list all the parameters that differ from the pipeline default.

Before running a process, it will download the required simgularity image.

By running the nexflow pipeline on the command line, the progress of the analysis is captured in real-life.

In the screenshot below, all the jobs which will be run are listed.

We can see that 4 jobs have run to completion:

NPUT_CHECK:SAMPLESHEET_CHECK (samplesheet.csv)
MIRNA_QUANT:PARSE_MATURE
MIRNA_QUANT:PARSE_HAIRPIN
GENOME_QUANT:INDEX_GENOME (genome.fa)

One singularity image is being pulled

This is a screenshot taken half way through the analysis:

A message will appear when your job has run to completion.

Launching Nextflow using a PBS script

Input files

Examples of samplesheet.csv

Parameters

Finding list of parameters available

Specifying parameters on the command line

Nextflow caching

Structure of work folder

Resume option

Nextflow pipeline outputs and PBS outputs

Results folder

Nextflow log, metrics and reports

PBS output

Troubleshooting

common error messages when starting with Nextflow

Where to from now?

Provide links to carpentry course: https://carpentries-incubator.github.io/workflows-nextflow/instructor/01-getting-started-with-nextflow.html

Introduction to Nextflow