Table of Contents | ||
---|---|---|
|
...
Some sections of this course were adapted from the Carpentry course: https://carpentries-incubator.github.io/workflows-nextflow/.
Prerequisites
You will require a basic knowledge of Linux/Unix commands to be able to participate effectively in this workshop. If you don’t, please attend the following training [Introduction to HPC].
1. Getting started with Nextflow
What is a workflow and what are workflow management systems?
Why should I use a workflow management system?
What is Nextflow?
What are the main features of Nextflow?
What are the main components of a Nextflow script?
What is Nextflow?
Nextflow is a free and open-source pipeline management software that enables scalable and reproducible scientific workflows. It allows the adaptation of pipelines written in the most common scripting languages.
Key features of Nextflow:
Reproducible → version control and use of containers ensure the reproducibility of nextflow pipelines
Portable → compute agnostic (i.e., HPC, cloud, desktop)
Scalable → run from a single to thousands of samples
Minimal digital literacy → accessible to anyone
Active global community → more and more nextflow pipelines are available (i.e., https://nf-co.re/pipelines )
...
For more information about Nextflow, please visit Nextflow - A DSL for parallel and scalable computational pipelines
Installing Nextflow
Have they already downloaded Nextflow, will need to update only
If they already have a config file, we will copy and create a new one!
Nextflow is meant to run from your home folder on a Linux machine like the HPC.
...
To install Nextflow, copy and paste the following block of code into your terminal (i.e., PuTTy that is already connected to the terminal) and hit 'enter':
Code Block |
---|
module load java curl -s https://get.nextflow.io | bash mv nextflow $HOME/bin |
Line 1: The module load command is necessary to ensure java is available
Line 2: This command downloads and assembles the parts of nextflow - this step might take some time.
Line 3: When finished, the nextflow binary will be in the current folder so it should be moved to your “bin” folder” so it can be found later.
To verify that Nextflow is installed properly, you can run the following command:
...
Code Block |
---|
mkdir $HOME/nftemp && cd $HOME/nftemp nextflow run hello |
Line 1: Make a temporary folder for Nextflow to create files when it runs.
Line 2: Verify Nextflow is working.
You should see something like this:
...
Code Block |
---|
cd $HOME rm -rf nftemp |
Nextflow’s base configuration
A key Nextflow feature is the ability to decouple the workflow implementation, which describes the flow of data and operations to perform on that data, from the configuration settings required by the underlying execution platform. This enables the workflow to be portable, allowing it to run on different computational platforms such as an institutional HPC or cloud infrastructure, without needing to modify the workflow implementation.
...
Code Block |
---|
[[ -d $HOME/.nextflow ]] || mkdir -p $HOME/.nextflow cat <<EOF > $HOME/.nextflow/config singularity { cacheDir = '$HOME/.nextflow/NXF_SINGULARITY_CACHEDIR' autoMounts = true } conda { cacheDir = '$HOME/.nextflow/NXF_CONDA_CACHEDIR' } process { executor = 'pbspro' scratch = false cleanup = false } includeConfig '/work/datasets/reference/nextflow/qutgenome.config' EOF |
Line 1: Check if a
.nextflow/config
file already exists in your home directory. Create it if it does not existLine 2-15: Using the cat command, paste text in the newly created
.nextflow/config
file which specifies the cache location for your singularity and conda.What are the parameters you are setting?
Line 3-6 set the directory where remote Singularity images are stored and direct Nextflow to automatically mount host paths in the executed container.
Line 7-9 set the directory where Conda environments are stored.
Line 10-14 sets default directives for processes in your pipeline. Note that the executor is set to pbspro on line 11.
Line 15 provides the local path to genome files required for pipelines such as nf-core/rnaseq
More in depth information on Nextflow configuration is described here: https://www.nextflow.io/docs/latest/config.html.
2. Nextflow pipeline repositories
nf-core
What is nf-core?
nf-core is a community-led project to develop a set of best-practice pipelines built using Nextflow workflow management system. Pipelines are governed by a set of guidelines, enforced by community code reviews and automatic code testing. The diagram below showcases the key aspects of nf-core and is divided into three sections:
the Deploy section includes features like Stable pipelines, Centralized configs, List and update pipelines, and Download for offline us.
the Participate section highlights Documentation, Slack workspace, Twitter updates, and Hackathons.
the Develop section emphasizes the Starter template, Code guidelines, CI code linting and tests, and Helper tools.
...
What are nf-core pipelines?
nf-core pipelines are an organised collection of Nextflow scripts, other non-nextflow scripts (written in any language), configuration files, software specifications, and documentation hosted on GitHub. There is generally a single pipeline for a given data and analysis type e.g. There is a single pipeline for bulk RNA-Seq. All nf-core pipelines are open source.
Searching for available nf-core pipelines
Go to https://nf-co.re/pipelines
...
nf-core/ampliseq
nf-core/rnaseq
nf-core/sarek
nf-core support
For support with Nextflow, see https://nf-co.re/join. For instance, there is a very active slack community for nf-core users.
epi2me workflows
EPI2ME Labs maintains a collection of bioinformatics workflows tailored to Oxford Nanopore Technologies long-read sequencing data. They are curated and actively maintained by experts in long-read sequence analysis.
...
Examples of pipelines used at QUT:
3. Running pipelines
Fetching pipeline code
The pull
command allows you to download the latest version of a project from a GitHub repository or to update it if that repository has previously been downloaded in your home directory.
...
Downloaded pipeline projects are stored in the folder $HOME/.nextflow/assets
.
Software requirements for pipelines
Nextflow pipeline software dependencies are specified using either Docker, Singularity or Conda. It is Nextflow that handles the downloading of containers and creation of conda environments. This is set using the -profile {docker,singularity,conda}
parameter when you run Nextflow.
At QUT, we use singularity so we would specify: -profile singularity
.
Install and test that the pipeline installed successfully
Pipelines generally include test code that can be run to make sure installation was successful.
From the command line
By running the Nexflow pipeline on the command line, the progress of the analysis is captured in real-time.
...
At the bottom, the message ‘Pipeline completed successfully’ will be printed along with the duration, the CPU hours and numbers of jobs that run to completion.
...
Launching Nextflow using a PBS script
Launching the Nextflow pipeline from the command line enabled us to understand what the pipeline does in real-time. But you have to make sure you keep the terminal page from which you launched the analysis opened until the analysis is done.
...
Code Block |
---|
chmod +x smrnaseq_test.sh qsub smrnaseq_test.sh |
4. Input specifications
Samplesheet input
Nextflow pipelines generally need an input file, often referred to as a samplesheet, which contains information about the samples you would like to analyse.
...
The samplesheet has to be a comma-separated file with a minimum set of columns (which will vary depending of the pipeline you are interested to run), and a header row.
Examples of samplesheets
For the nf-core/smrnaseq pipeline, the samplesheet has to be a comma-separated file with the following 2 columns.
...
Column names has to be specified in a header row as shown in the samplesheet example below:
...
sample,fastq_1
Clone1_N1,s3://ngi-igenomes/test-data/smrnaseq/C1-N1-R1_S4_L001_R1_001.fastq.gz
Clone1_N3,s3://ngi-igenomes/test-data/smrnaseq/C1-N3-R1_S6_L001_R1_001.fastq.gz
Clone9_N1,s3://ngi-igenomes/test-data/smrnaseq/C9-N1-R1_S7_L001_R1_001.fastq.gz
Clone9_N2,s3://ngi-igenomes/test-data/smrnaseq/C9-N2-R1_S8_L001_R1_001.fastq.gz
Clone9_N3,s3://ngi-igenomes/test-data/smrnaseq/C9-N3-R1_S9_L001_R1_001.fastq.gz
Control_N1,s3://ngi-igenomes/test-data/smrnaseq/Ctl-N1-R1_S1_L001_R1_001.fastq.gz
Control_N2,s3://ngi-igenomes/test-data/smrnaseq/Ctl-N2-R1_S2_L001_R1_001.fastq.gz
Control_N3,s3://ngi-igenomes/test-data/smrnaseq/Ctl-N3-R1_S3_L001_R1_001.fastq.gz
...
For the nf-core/rnaseq pipeline, the samplesheet has to be a comma-separated file with the following 4 columns:
...
Column names has to be specified in a header row as shown in the samplesheet example below:
...
sample,fastq_1,fastq_2,strandedness
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,auto
...
Please note that in this example, the same sample (CONTROL_REP1) was sequenced across 3 lanes. The nf-core/sarek pipeline will concatenate the raw reads before performing any downstream analysis.
Exercise 1
The following samplesheet file for the nf-core/rnaseq pipeline consisting of both single- and paired-end data is ready for analysis.
...
Expand | ||
---|---|---|
| ||
There are 6 samples in total, as |
Exercise 2
Find what are the minimal columns required in the samplesheet to run nfcore/ampliseq
Expand | ||
---|---|---|
| ||
You will need to go to the usage page of nfcore/ampliseq which can be found at https://nf-co.re/ampliseq/2.9.0/docs/usage#samplesheet-input (make sure you are using the latest version of the pipeline). The input specification section will specify that the samplesheet must minimally contain 2 columns: |
Input folder
Some pipelines like nf-core/ampliseq will let you specify directly the path to the folder that contains your input FASTQ files, as an alternative to using a samplesheet.
...
Where sample1
and sample2
were sequenced in one sequencing run and sample3
and sample4
in another sequencing run.
5. Parameters
Finding list of parameters available
For the nf-core pipelines, the tools implemented and the range of parameters available are generally described in the Usage section. Some of the parameters will be required, others optional.
...
All the parameters available will also be listed under the Parameters section:
...
Exercise 1
Using the usage and parameters sections, search how many aligner options are available for the nf-core rnaseq pipeline version 3.14.0 .
Expand | ||
---|---|---|
| ||
There are 3 aligner algorithms available: 'star_salmon', 'star_rsem' and 'hisat2'. |
Specifying parameters on the command line
Parameters are generally specified on the CLI (i.e. command line interface).
Code Block |
---|
nextflow run nf-core/rnaseq -profile singularity -resume \ --input samplesheet.csv \ --outdir results \ --genome GRCm38 \ --aligner star_salmon \ --extra_trimgalore_args "--quality 30 --clip_r1 10 --clip_r2 10 --three_prime_clip_r1 1 --three_prime_clip_r2 1 " |
6. Nextflow caching
One of the core features of Nextflow is the ability to cache task executions and re-use them in subsequent runs to minimize duplicate work. Resumability is useful both for recovering from errors and for iteratively developing a pipeline
Resume option
You can enable resumability in Nextflow with the -resume
flag when launching a pipeline with nextflow run
.
All task executions are automatically saved to the task cache, regardless of the -resume
option (so that you always have the option to resume later).
Structure of work folder
When nextflow runs, it assigns a unique ID to each task. This unique ID is used to create a separate execution directory, within the work
directory, where the tasks are executed and the results stored. A task’s unique ID is generated as a 128-bit hash number.
...
We can use the Bash tree
command to list the contents of the work directory. Note: By default tree does not print hidden files (those beginning with a dot .
). Use the -a
to view all files.
Code Block |
---|
tree -a work |
Provide a relevant example from test run
Example of work directory:
Code Block |
---|
work/ ├── 12 │ └── 5489f3c7dbd521c0e43f43b4c1f352 │ ├── .command.begin │ ├── .command.err │ ├── .command.log │ ├── .command.out │ ├── .command.run │ ├── .command.sh │ ├── .exitcode │ └── temp33_1_2.fq.gz -> /home/training/data/yeast/reads/temp33_1_2.fq.gz ├── 3b │ └── a3fb24ad3242e4cc8e5aa0c24d174b │ ├── .command.begin │ ├── .command.err │ ├── .command.log │ ├── .command.out │ ├── .command.run │ ├── .command.sh │ ├── .exitcode │ └── temp33_2_1.fq.gz -> /home/training/data/yeast/reads/temp33_2_1.fq.gz ├── 4c │ └── 125b5e5a5ee144fa25dd9bccd467e9 │ ├── .command.begin │ ├── .command.err │ ├── .command.log │ ├── .command.out │ ├── .command.run │ ├── .command.sh │ ├── .exitcode │ └── temp33_3_1.fq.gz -> /home/training/data/yeast/reads/temp33_3_1.fq.gz ├── 54 │ └── eb9d72e9ac24af8183de569ab0b977 │ ├── .command.begin │ ├── .command.err │ ├── .command.log │ ├── .command.out │ ├── .command.run │ ├── .command.sh │ ├── .exitcode │ └── temp33_2_2.fq.gz -> /home/training/data/yeast/reads/temp33_2_2.fq.gz ├── e9 │ └── 31f28c291481342cc45d4e176a200a │ ├── .command.begin │ ├── .command.err │ ├── .command.log │ ├── .command.out │ ├── .command.run │ ├── .command.sh │ ├── .exitcode │ └── temp33_1_1.fq.gz -> /home/training/data/yeast/reads/temp33_1_1.fq.gz └── fa └── cd3e49b63eadd6248aa357083763c1 ├── .command.begin ├── .command.err ├── .command.log ├── .command.out ├── .command.run ├── .command.sh ├── .exitcode └── temp33_3_2.fq.gz -> /home/training/data/yeast/reads/temp33_3_2.fq.gz |
Task execution directory
Within the work
directory there are multiple task execution directories. There is one directory for each time a process is executed. These task directories are identified by the process execution hash. For example the task directory fa/cd3e49b63eadd6248aa357083763c1
would be location for the process identified by the hash fa/cd3e49
.
...
.command.sh
: The command script..command.run
: The file is a bash script that Nextflow generates to execute the .command.sh script, handling the necessary environment setup and command execution details..command.out
: The complete job standard output..command.err
: The complete job standard error..command.log
: The wrapper execution output..command.begin
: A file created as soon as the job is launched..exitcode
: A file containing the task exit code.Any task input files (symlinks)
Any task output files
Specifying another work directory
Depending on your script, this work folder can take a lot of disk space. You can specify another work directory using the command line option -w
. Note Using a different work directory will mean that any jobs will need to re-run from the beginning.
Clean the work directory
If you are sure you won’t resume your pipeline execution, you can clean the work folder using the nextflow clean
command. It is good practice to do so regularly.
Code Block |
---|
nextflow clean [run_name|session_id] [options] |
7. Nextflow pipeline outputs and PBS outputs
Results folder
The results are output in the folder name specified in the .nexftlow.config file under the outdir parameter. It is generally set to be results.
Code Block |
---|
// nextflow.config params { outdir = 'results' } |
Nextflow log, metrics and reports
By default, Nextflow will create a log file in the working directory called .nextflow.log. This file is hidden but you can see it using the command:
...
Code Block |
---|
nextflow -log ~/code/nextflow.log run |
PBS output
8. Where to from now?
Nextflow offers free Fundamentals Training: https://training.nextflow.io/basic_training/
...