Table of Contents | ||
---|---|---|
|
Prerequisites
You will require a basic knowledge of Linux/Unix commands to be able to participate effectively in this workshop. If you don’t, please attend the following training [Introduction to HPC].
Getting started with Nextflow
What is a workflow and what are workflow management systems?
Why should I use a workflow management system?
What is Nextflow?
What are the main features of Nextflow?
What are the main components of a Nextflow script?
What is Nextflow?
Nextflow is a free and open-source pipeline management software that enables scalable and reproducible scientific workflows. It allows the adaptation of pipelines written in the most common scripting languages.
Key features of Nextflow:
Reproducible → version control and use of containers ensure the reproducibility of nextflow pipelines
Portable → compute agnostic (i.e., HPC, cloud, desktop)
Scalable → run from a single to thousands of samples
Minimal digital literacy → accessible to anyone
Active global community → more and more nextflow pipelines are available (i.e., https://nf-co.re/pipelines )
...
Nextflow is a pipeline engine that can take advantage of the batch nature of the HPC environment to efficiently and quickly run Bioinformatic workflows.
For more information about Nextflow, please visit Nextflow - A DSL for parallel and scalable computational pipelines
Installing Nextflow
Nextflow is meant to run from your home folder on a Linux machine like the HPC.
First connect to your Lyra account.
Before we start using the HPC, let’s start an interactive session:
Code Block |
---|
qsub -I -S /bin/bash -l walltime=10:00:00 -l select=1:ncpus=1:mem=4gb |
You should be in your home directory, if unsure you can run the following command:
Code Block |
---|
cd ~ |
To install Nextflow, copy and paste the following block of code into your terminal (i.e., PuTTy that is already connected to the terminal) and hit 'enter':
Code Block |
---|
module load java
curl -s https://get.nextflow.io | bash
mv nextflow $HOME/bin |
Line 1: The module load command is necessary to ensure java is available
Line 2: This command downloads and assembles the parts of nextflow - this step might take some time.
Line 3: When finished, the nextflow binary will be in the current folder so it should be moved to your “bin” folder” so it can be found later.
To verify that Nextflow is installed properly, you can run locally a simple Nextflow pipeline called Hello:
Code Block |
---|
mkdir $HOME/nftemp && cd $HOME/nftemp
nextflow run hello |
Line 1: Make a temporary folder for Nextflow to create files when it runs.
Line 2: Verify Nextflow is working.
You should see something like this:
...
If you got this output, well done! You have run your first Nextflow pipeline successfully.
Now go back to your home directory and clean the test folder.
Code Block |
---|
cd $HOME
rm -rf nftemp |
Nextflow’s base configuration
A key Nextflow feature is the ability to decouple the workflow implementation, which describes the flow of data and operations to perform on that data, from the configuration settings required by the underlying execution platform. This enables the workflow to be portable, allowing it to run on different computational platforms such as an institutional HPC or cloud infrastructure, without needing to modify the workflow implementation.
For instance, a user can configure Nextflow so it runs the pipelines locally (i.e. on the computer where Nextflow is launched), which can be useful for developing and testing a pipeline script on your computer
Code Block |
---|
\\default Nextflow settings
process {
executor = 'local'
} |
or configure Nextflow to run on a cluster such as a PBS Pro resource manager:
Code Block |
---|
process {
executor = 'pbspro'
} |
The base configuration that is applied to every Nextflow workflow you run is located in $HOME/.nextflow/config
.
Once you have installed Nextflow on Lyra, there are some settings that should be applied to your $HOME/.nextflow/config
to take advantage of the HPC environment at QUT.
To create a suitable config file for use on the QUT HPC, copy and paste the following text into your Linux command line and hit ‘enter’. This will make the necessary changes to your local account so that Nextflow can run correctly:
Code Block |
---|
[[ -d $HOME/.nextflow ]] || mkdir -p $HOME/.nextflow
cat <<EOF > $HOME/.nextflow/config
singularity {
cacheDir = '$HOME/.nextflow/NXF_SINGULARITY_CACHEDIR'
autoMounts = true
}
conda {
cacheDir = '$HOME/.nextflow/NXF_CONDA_CACHEDIR'
}
process {
executor = 'pbspro'
scratch = false
cleanup = false
}
includeConfig '/work/datasets/reference/nextflow/qutgenome.config'
EOF |
Line 1: Check if a
.nextflow/config
file already exists in your home directory. Create it if it does not existLine 2-15: Using the cat command, paste text in the newly created
.nextflow/config
file which specifies the cache location for your singularity and conda.What are the parameters you are setting?
Line 3-6 set the directory where remote Singularity images are stored and direct Nextflow to automatically mount host paths in the executed container.
Line 7-9 set the directory where Conda environments are stored.
Line 10-14 sets default directives for processes in your pipeline. Note that the executor is set to pbspro on line 11.
Line 15 provides the local path to genome files required for pipelines such as nf-core/rnaseq
More in depth information on Nextflow configuration is described here: https://www.nextflow.io/docs/latest/config.html.
Table of Contents | ||
---|---|---|
|
View file | ||
---|---|---|
|
This instructional material was originally developed by Maely Gauthier in 2024 as part of the QUT eResearch infrastructure. It is free to distribute but we just require that you acknowledge eResearch for any outputs (e.g. training, presentation slides, publications) that might result from using this training material.
Some sections of this course were adapted from the Carpentry course: https://carpentries-incubator.github.io/workflows-nextflow/.
Aims
Learn what is Nextflow
Install and configure Nextflow
Find pipelines on repositories (e.g. nf-core and epi2me)
Run pipelines using either the command line or a PBS script
Understand input and parameter specifications
Understand the concept of caching and the resume function
Understand how Nextflow pipelines output results
What will be covered during the workshop
1. Getting started with Nextflow
What is Nextflow?
Installing Nextflow
Nextflow’s base configuration
2. Nextflow pipeline repositories
nf-core
What is nf-core?
...
nf-core is a community-led project to develop a set of best-practice pipelines built using Nextflow workflow management system. Pipelines are governed by a set of guidelines, enforced by community code reviews and automatic code testing. The diagram below showcases the key aspects of nf-core and is divided into three sections:
the Deploy section includes features like Stable pipelines, Centralized configs, List and update pipelines, and Download for offline us.
the Participate section highlights Documentation, Slack workspace, Twitter updates, and Hackathons.
the Develop section emphasizes the Starter template, Code guidelines, CI code linting and tests, and Helper tools.
...
What are nf-core pipelines?
...
Searching for available nf-core
...
Go to https://nf-co.re/pipelines
Narrow search by typing relevant term, for example ‘rna-seq’:
...
Pipelines can be sorted by Latest release, Name or Stars:
...
pipelines
...
nf-core/ampliseq
nf-core/rnaseq
- nf-core/sarek
nf-core
...
support
...
epi2me workflows
EPI2ME Labs maintains a collection of bioinformatics workflows tailored to Oxford Nanopore Technologies long-read sequencing data. They are curated and actively maintained by experts in long-read sequence analysis.
https://eresearchqut.atlassian.net/wiki/spaces/EG/pages/edit-v2/2261090311#epi2me
Examples of pipelines used at QUT:
3. Running pipelines
Fetching pipeline
...
The pull
command allows you to download the latest version of a project from a GitHub repository or to update it if that repository has previously been downloaded in your home directory.
Please note that Nextflow would also automatically fetch the pipeline code when you run the command below for the first time:
Code Block |
---|
nextflow run nf-core/<pipeline> |
For reproducibility, it is good to explicitly reference the pipeline version number that you wish to use with the -revision
/-r
flag.
In the example below we are pulling the rnaseq pipeline version 3.12.0
Code Block |
---|
nextflow pull nf-core/rnaseq -revision 3.12.0 |
We can see from the output we have the latest release.
Downloaded pipeline projects are stored in the folder $HOME/.nextflow/assets
.
code
...
nextflow pull nf-core/<pipeline>
Software requirements for pipelines
Nextflow pipeline software dependencies are specified using either Docker, Singularity or Conda. It is Nextflow that handles the downloading of containers and creation of conda environments. This is set using the -profile {docker,singularity,conda}
parameter when you run Nextflow.
At QUT, we use singularity so we would specify: -profile singularity
.
...
Install and test that the pipeline installed successfully
...
From the command line
Run the following command from your home directory:
Code Block |
---|
cd
nextflow run nf-core/smrnaseq -profile test,singularity --outdir results -r 2.1.0 |
This will download the smrnaseq pipeline and then run the test code. It should take ~20-30 minutes to run to completion.
It will first display the version of the pipeline which was downloaded: version 2.1.0
It will then list all the parameters that differ from the pipeline default.
...
Before running a process, it will download the required singularity image.
By running the Nexflow pipeline on the command line, the progress of the analysis is captured in real-life.
In the screenshot below, all the jobs which will be run are listed.
We can see that 4 jobs have run to completion:
NPUT_CHECK:SAMPLESHEET_CHECK (samplesheet.csv)
MIRNA_QUANT:PARSE_MATURE
MIRNA_QUANT:PARSE_HAIRPIN
GENOME_QUANT:INDEX_GENOME (genome.fa)
One singularity image is being pulled
...
This is a screenshot taken half way through the analysis:
...
This is the output you should get when your Nextflow job has run to completion.
At the bottom, the message ‘Pipeline completed successfully’ will be printed along with the duration, the CPU and numbers of jobs that run to completion.
...
Launching Nextflow using a PBS script
[Get them to run a PBS script themselves]
Input specifications
Samplesheet input
Nextflow pipelines generally need an input file, often referred to as a samplesheet, which contains information about the samples you would like to analyse.
The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first columns to match those required by the pipeline.
The minimum information required will vary and will be specified on the usage section of the pipeline that you are interested to run.
When running Nextflow, use this parameter to specify the samplesheet location: --input '[path to samplesheet file]'
The samplesheet has to be a comma-separated file with a minimum set of columns (which will vary depending of the pipeline you are interested to run), and a header row.
Examples of samplesheets
For the nf-core/smrnaseq pipeline, the samplesheet has to be a comma-separated file with the following 2 columns.
...
Column
...
Description
...
sample
...
Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (_
).
...
fastq_1
...
Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension “.fastq.gz” or “.fq.gz”.
Column names has to be specified in a header row as shown in the samplesheet example below:
sample,fastq_1
Clone1_N1,s3://ngi-igenomes/test-data/smrnaseq/C1-N1-R1_S4_L001_R1_001.fastq.gz
Clone1_N3,s3://ngi-igenomes/test-data/smrnaseq/C1-N3-R1_S6_L001_R1_001.fastq.gz
Clone9_N1,s3://ngi-igenomes/test-data/smrnaseq/C9-N1-R1_S7_L001_R1_001.fastq.gz
Clone9_N2,s3://ngi-igenomes/test-data/smrnaseq/C9-N2-R1_S8_L001_R1_001.fastq.gz
Clone9_N3,s3://ngi-igenomes/test-data/smrnaseq/C9-N3-R1_S9_L001_R1_001.fastq.gz
Control_N1,s3://ngi-igenomes/test-data/smrnaseq/Ctl-N1-R1_S1_L001_R1_001.fastq.gz
Control_N2,s3://ngi-igenomes/test-data/smrnaseq/Ctl-N2-R1_S2_L001_R1_001.fastq.gz
Control_N3,s3://ngi-igenomes/test-data/smrnaseq/Ctl-N3-R1_S3_L001_R1_001.fastq.gz
For the nf-core/rnaseq pipeline, the samplesheet has to be a comma-separated file with the following 4 columns:
...
Column
...
Description
...
sample
...
Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (_
).
...
fastq_1
...
Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension “.fastq.gz” or “.fq.gz”.
...
fastq_2
...
Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension “.fastq.gz” or “.fq.gz”.
...
strandedness
...
Sample strand-specificity. Must be one of unstranded
, forward
, reverse
or auto
.
Column names has to be specified in a header row as shown in the samplesheet example below:
sample,fastq_1,fastq_2,strandedness
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,auto
Please note that in this example, the same sample (CONTROL_REP1) was sequenced across 3 lanes. The nf-core/sarek pipeline will concatenate the raw reads before performing any downstream analysis.
Exercise 1
The following samplesheet file for the nf-core/rnaseq pipeline consisting of both single- and paired-end data is ready for analysis.
How many samples does it have in total?
How many are single-end and paired-end?
Code Block | ||
---|---|---|
| ||
sample,fastq_1,fastq_2,strandedness
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,forward
CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz,forward
CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz,forward
TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz,,reverse
TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz,,reverse
TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz,,reverse
TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz,,reverse |
Expand | ||
---|---|---|
| ||
There are 6 samples in total, as |
Exercise 2
Find what are the minimal columns required in the samplesheet to run nfcore/ampliseq
Expand | ||
---|---|---|
| ||
You will need to go to the usage page of nfcore/ampliseq which can be found at https://nf-co.re/ampliseq/2.9.0/docs/usage#samplesheet-input (make sure you are using the latest version of the pipeline). The input specification section will specify that the samplesheet must minimally contain 2 columns: |
Input folder
Some pipelines like nf-core/ampliseq will let you specify directly the path to the folder that contains your input FASTQ files, as an alternative to using a samplesheet.
For example:
Code Block | ||
---|---|---|
| ||
--input_folder 'path/to/data/' |
File names must follow a specific pattern, default is /*_R{1,2}_001.fastq.gz
, but this can be adjusted with the --extension
parameter.
For example, the following files in the folder data
would be processed as sample1
and sample2
:
Code Block | ||
---|---|---|
| ||
data
|-sample1_1_L001_R1_001.fastq.gz
|-sample1_1_L001_R2_001.fastq.gz
|-sample2_1_L001_R1_001.fastq.gz
|-sample2_1_L001_R2_001.fastq.gz |
All sequencing data should originate from one sequencing run, because processing relies on run-specific error models that are unreliable when data from several sequencing runs are mixed. Sequencing data originating from multiple sequencing runs requires additionally the parameter --multiple_sequencing_runs
and a specific folder structure, for example:
Code Block | ||
---|---|---|
| ||
data
|-runA
| |-sample1_1_L001_R1_001.fastq.gz
| |-sample1_1_L001_R2_001.fastq.gz
| |-sample2_1_L001_R1_001.fastq.gz
| |-sample2_1_L001_R2_001.fastq.gz
|
|-runB
|-sample3_1_L001_R1_001.fastq.gz
|-sample3_1_L001_R2_001.fastq.gz
|-sample4_1_L001_R1_001.fastq.gz
|-sample4_1_L001_R2_001.fastq.gz |
Where sample1
and sample2
were sequenced in one sequencing run and sample3
and sample4
in another sequencing run.
Parameters
Finding list of parameters available
For the nf-core pipelines, the tools implemented and the range of parameters available are generally described in the Usage section. Some of the parameters will be required, others optional.
Let’s have a look at the nf-core/rnaseq pipeline:
...
All the parameters available will also be listed under the Parameters section:
...
Exercise 1
Using the usage and parameters sections, search how many aligner options are available for the nf-core rnaseq pipeline version 3.14.0 .
Expand | ||
---|---|---|
| ||
There are 3 aligner algorithms available: 'star_salmon', 'star_rsem' and 'hisat2'. |
Specifying parameters on the command line
Parameters are generally specified on the CLI (i.e. command line interface).
Code Block |
---|
nextflow run nf-core/rnaseq -profile singularity -resume \
--input samplesheet.csv \
--outdir results \
--genome GRCm38 \
--aligner star_salmon \
--extra_trimgalore_args "--quality 30 --clip_r1 10 --clip_r2 10 --three_prime_clip_r1 1 --three_prime_clip_r2 1 " |
Nextflow caching
One of the core features of Nextflow is the ability to cache task executions and re-use them in subsequent runs to minimize duplicate work. Resumability is useful both for recovering from errors and for iteratively developing a pipeline
You can enable resumability in Nextflow with the -resume
flag when launching a pipeline with nextflow run
.
All task executions are automatically saved to the task cache, regardless of the -resume
option (so that you always have the option to resume later).
Structure of work folder
Resume option
Nextflow pipeline outputs and PBS outputs
Results folder
Nextflow log, metrics and reports
PBS output
Troubleshooting
common error messages when starting with Nextflow
Where to from now?
Nextflow offers free Fundamentals Training: https://training.nextflow.io/basic_training/
...
Launching Nextflow using a PBS script
4. Input specifications
Samplesheet input
Examples of samplesheets
Exercise 1
Exercise 2
Input folder
5. Parameters
Finding list of parameters available
Exercise 1
Specifying parameters on the command line
6. Nextflow caching
Resume option
Structure of work folder
Task execution directory
Specifying another work directory
Clean the work directory
7. Nextflow pipeline outputs
Results folder
Nextflow log, metrics and reports
8. Where to from now?
Prerequisites
You will require a basic knowledge of Linux/Unix commands to be able to participate effectively in this workshop. For this workshop we assume participants have either attended the first 2 workshops, reviewed the materials provided in these workshops (if unable to attend) and are comfortable with it, or are already using the HPC.
You can watch some videos that go overt the basics: https://mediahub.qut.edu.au/media/t/0_d0bsv333
Initial requirements
To be able to run these exercises, you’ll need:
A HPC account
PuTTy installed on your local computer
Access your HPC home directory from your PC
Instructions for getting a HPC account are here: https://qutvirtual4.qut.edu.au/group/staff/research/conducting/facilities/advanced-research-computing-storage/supercomputing/getting-started-with-hpc
You’ll need PuTTY on your PC to access the HPC.
You can download PuTTY from here: https://the.earth.li/~sgtatham/putty/latest/w64/putty.exe
Then add the HPC (Lyra) address: lyra.qut.edu.au and then click ‘open’.
...
Setup Windows File Explorer to access your HPC home account. Follow the instructions here: