This instructional material was originally developed by Maely Gauthier in 2024 as part of the QUT eResearch infrastructure. It is free to distribute but we just require that you acknowledge eResearch for any outputs (e.g. training, presentation slides, publications) that might result from using this training material.

Some sections of this course were adapted from the Carpentry course: https://carpentries-incubator.github.io/workflows-nextflow/.

Aims

What will be covered during the workshop:

Prerequisites
1. Getting started with Nextflow
- What is Nextflow?
- Installing Nextflow
- Nextflow’s base configuration
2. Nextflow pipeline repositories
- nf-core
  - What is nf-core?
  - What are nf-core pipelines?
  - Searching for available nf-core pipelines
  - nf-core support
- epi2me workflows
3. Running pipelines
4. Input specifications
- Samplesheet input
- Input folder
5. Parameters
- Finding list of parameters available
  - Exercise 1
- Specifying parameters on the command line
6. Nextflow caching
- Resume option
- Structure of work folder
7. Nextflow pipeline outputs and PBS outputs
8. Where to from now?

Prerequisites

You will require a basic knowledge of Linux/Unix commands to be able to participate effectively in this workshop. If you don’t, please attend the following training [Introduction to HPC].

5. Parameters

Finding list of parameters available

For the nf-core pipelines, the tools implemented and the range of parameters available are generally described in the Usage section. Some of the parameters will be required, others optional.

Let’s have a look at the nf-core/rnaseq pipeline:

All the parameters available will also be listed under the Parameters section:

Exercise 1

Using the usage and parameters sections, search how many aligner options are available for the nf-core rnaseq pipeline version 3.14.0 .

Solution

There are 3 aligner algorithms available: 'star_salmon', 'star_rsem' and 'hisat2'.

Specifying parameters on the command line

Parameters are generally specified on the CLI (i.e. command line interface).

nextflow run nf-core/rnaseq -profile singularity -resume \
        --input samplesheet.csv \
        --outdir results \
        --genome GRCm38 \
        --aligner star_salmon \
        --extra_trimgalore_args "--quality 30 --clip_r1 10 --clip_r2 10 --three_prime_clip_r1 1 --three_prime_clip_r2 1 "

6. Nextflow caching

One of the core features of Nextflow is the ability to cache task executions and re-use them in subsequent runs to minimize duplicate work. Resumability is useful both for recovering from errors and for iteratively developing a pipeline

Resume option

You can enable resumability in Nextflow with the -resume flag when launching a pipeline with nextflow run.

All task executions are automatically saved to the task cache, regardless of the -resume option (so that you always have the option to resume later).

Structure of work folder

When nextflow runs, it assigns a unique ID to each task. This unique ID is used to create a separate execution directory, within the work directory, where the tasks are executed and the results stored. A task’s unique ID is generated as a 128-bit hash number.

When we resume a workflow, Nextflow uses this unique ID to check if:

The working directory exists
It contains a valid command exit status
It contains the expected output files.

If these conditions are satisfied, the task execution is skipped and the previously computed outputs are applied.

When a task requires recomputation, ie. the conditions above are not fulfilled, the downstream tasks are automatically invalidated.

Therefore, if you modify some parts of your script, or alter the input data using -resume, Nextflow will only execute the processes that are actually changed.

The execution of the processes that are not changed will be skipped and the cached result used instead.

This helps a lot when testing or modifying part of your pipeline without having to re-execute it from scratch.

By default the pipeline results are cached in the directory work where the pipeline is launched.

We can use the Bash tree command to list the contents of the work directory. Note: By default tree does not print hidden files (those beginning with a dot .). Use the -a to view all files.

tree -a work

Provide a relevant example from test run

Example of work directory:

work/
├── 12
│   └── 5489f3c7dbd521c0e43f43b4c1f352
│       ├── .command.begin
│       ├── .command.err
│       ├── .command.log
│       ├── .command.out
│       ├── .command.run
│       ├── .command.sh
│       ├── .exitcode
│       └── temp33_1_2.fq.gz -> /home/training/data/yeast/reads/temp33_1_2.fq.gz
├── 3b
│   └── a3fb24ad3242e4cc8e5aa0c24d174b
│       ├── .command.begin
│       ├── .command.err
│       ├── .command.log
│       ├── .command.out
│       ├── .command.run
│       ├── .command.sh
│       ├── .exitcode
│       └── temp33_2_1.fq.gz -> /home/training/data/yeast/reads/temp33_2_1.fq.gz
├── 4c
│   └── 125b5e5a5ee144fa25dd9bccd467e9
│       ├── .command.begin
│       ├── .command.err
│       ├── .command.log
│       ├── .command.out
│       ├── .command.run
│       ├── .command.sh
│       ├── .exitcode
│       └── temp33_3_1.fq.gz -> /home/training/data/yeast/reads/temp33_3_1.fq.gz
├── 54
│   └── eb9d72e9ac24af8183de569ab0b977
│       ├── .command.begin
│       ├── .command.err
│       ├── .command.log
│       ├── .command.out
│       ├── .command.run
│       ├── .command.sh
│       ├── .exitcode
│       └── temp33_2_2.fq.gz -> /home/training/data/yeast/reads/temp33_2_2.fq.gz
├── e9
│   └── 31f28c291481342cc45d4e176a200a
│       ├── .command.begin
│       ├── .command.err
│       ├── .command.log
│       ├── .command.out
│       ├── .command.run
│       ├── .command.sh
│       ├── .exitcode
│       └── temp33_1_1.fq.gz -> /home/training/data/yeast/reads/temp33_1_1.fq.gz
└── fa
    └── cd3e49b63eadd6248aa357083763c1
        ├── .command.begin
        ├── .command.err
        ├── .command.log
        ├── .command.out
        ├── .command.run
        ├── .command.sh
        ├── .exitcode
        └── temp33_3_2.fq.gz -> /home/training/data/yeast/reads/temp33_3_2.fq.gz

Task execution directory

Within the work directory there are multiple task execution directories. There is one directory for each time a process is executed. These task directories are identified by the process execution hash. For example the task directory fa/cd3e49b63eadd6248aa357083763c1 would be location for the process identified by the hash fa/cd3e49 .

The task execution directory contains:

.command.sh: The command script.
.command.run: The file is a bash script that Nextflow generates to execute the .command.sh script, handling the necessary environment setup and command execution details.
.command.out: The complete job standard output.
.command.err: The complete job standard error.
.command.log: The wrapper execution output.
.command.begin: A file created as soon as the job is launched.
.exitcode: A file containing the task exit code.
Any task input files (symlinks)
Any task output files

Specifying another work directory

Depending on your script, this work folder can take a lot of disk space. You can specify another work directory using the command line option -w. Note Using a different work directory will mean that any jobs will need to re-run from the beginning.

Clean the work directory

If you are sure you won’t resume your pipeline execution, you can clean the work folder using the nextflow clean command. It is good practice to do so regularly.

nextflow clean [run_name|session_id] [options]

7. Nextflow pipeline outputs and PBS outputs

Results folder

The results are output in the folder name specified in the .nexftlow.config file under the outdir parameter. It is generally set to be results.

// nextflow.config
params {
  outdir = 'results'
}

Nextflow log, metrics and reports

By default, Nextflow will create a log file in the working directory called .nextflow.log. This file is hidden but you can see it using the command:

ls -a

If you rerun the pipeline in the same folder, the previous .nextflow.log will be renamed .nextflow.log.1 and a new .nextflow.log will be generated.

You can change the default location by specifying a different location

nextflow -log ~/code/nextflow.log run

PBS output

8. Where to from now?

Nextflow offers free Fundamentals Training: https://training.nextflow.io/basic_training/

Provide links to carpentry course: https://carpentries-incubator.github.io/workflows-nextflow/instructor/01-getting-started-with-nextflow.html

2024-S2 eResearch - Session 3: Introduction to Nextflow