Table of Contents | ||
---|---|---|
|
...
Nextflow is a free and open-source pipeline management software that enables scalable and reproducible scientific workflows. It allows the adaptation of pipelines written in the most common scripting languages.
Key features of Nextflow:
Reproducible → version control and use of containers ensure the reproducibility of nextflow pipelines
Portable → compute agnostic (i.e., HPC, cloud, desktop)
Scalable → run from a single to thousands of samples
Minimal digital literacy → accessible to anyone
Active global community → more and more nextflow pipelines are available (i.e., https://nf-co.re/pipelines )
...
To install Nextflow, copy and paste the following block of code into your terminal (i.e., PuTTy that is already connected to the terminal) and hit 'enter':
Code Block |
---|
module load java curl -s https://get.nextflow.io | bash mv nextflow $HOME/bin |
Line 1: The module load command is necessary to ensure java is available
Line 2: This command downloads and assembles the parts of nextflow - this step might take some time.
Line 3: When finished, the nextflow binary will be in the current folder so it should be moved to your “bin” folder” so it can be found later.
To verify that Nextflow is installed properly, you can run locally a simple Nextflow pipeline called Hello:
Code Block |
---|
mkdir $HOME/nftemp && cd $HOME/nftemp nextflow run hello |
Line 1: Make a temporary folder for Nextflow to create files when it runs.
Line 2: Verify Nextflow is working.
You should see something like this:
...
Code Block |
---|
[[ -d $HOME/.nextflow ]] || mkdir -p $HOME/.nextflow cat <<EOF > $HOME/.nextflow/config singularity { cacheDir = '$HOME/.nextflow/NXF_SINGULARITY_CACHEDIR' autoMounts = true } conda { cacheDir = '$HOME/.nextflow/NXF_CONDA_CACHEDIR' } process { executor = 'pbspro' scratch = false cleanup = false } EOFincludeConfig '/work/datasets/reference/nextflow/qutgenome.config' EOF |
Line 1: Check if a
.nextflow/config
file already exists in your home directory. Create it if it does not existLine 2-15: Using the cat command, paste text in the newly created
.nextflow/config
file which specifies the cache location for your singularity and conda.What are the parameters you are setting?
Line 3-6 set the directory where remote Singularity images are stored and direct Nextflow to automatically mount host paths in the executed container.
Line 7-9 set the directory where Conda environments are stored.
Line 10-14 sets default directives for processes in your pipeline. Note that the executor is set to pbspro on line 11.
Line 15 provides the local path to genome files required for pipelines such as nf-core/rnaseq
More in depth information on Nextflow configuration is described here: https://www.nextflow.io/docs/latest/config.html.
...
Column names has to be specified in a header row as shown in the samplesheet example below:
...
sample,fastq_1
Clone1_N1,s3://ngi-igenomes/test-data/smrnaseq/C1-N1-R1_S4_L001_R1_001.fastq.gz
Clone1_N3,s3://ngi-igenomes/test-data/smrnaseq/C1-N3-R1_S6_L001_R1_001.fastq.gz
Clone9_N1,s3://ngi-igenomes/test-data/smrnaseq/C9-N1-R1_S7_L001_R1_001.fastq.gz
Clone9_N2,s3://ngi-igenomes/test-data/smrnaseq/C9-N2-R1_S8_L001_R1_001.fastq.gz
Clone9_N3,s3://ngi-igenomes/test-data/smrnaseq/C9-N3-R1_S9_L001_R1_001.fastq.gz
Control_N1,s3://ngi-igenomes/test-data/smrnaseq/Ctl-N1-R1_S1_L001_R1_001.fastq.gz
Control_N2,s3://ngi-igenomes/test-data/smrnaseq/Ctl-N2-R1_S2_L001_R1_001.fastq.gz
Control_N3,s3://ngi-igenomes/test-data/smrnaseq/Ctl-N3-R1_S3_L001_R1_001.fastq.gz
...
For the nf-core/rnaseq pipeline, the samplesheet has to be a comma-separated file with the following 4 columns:
...
Column names has to be specified in a header row as shown in the samplesheet example below:
...
sample,fastq_1,fastq_2,strandedness
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,auto
...
Please note that in this example, the same sample (CONTROL_REP1) was sequenced across 3 lanes. The nf-core/sarek pipeline will concatenate the raw reads before performing any downstream analysis.
...
Where sample1
and sample2
were sequenced in one sequencing run and sample3
and sample4
in another sequencing run.
Parameters
Finding list of parameters available
...
For the nf-core pipelines, the tools implemented and the range of parameters available are generally described in the Usage section. Some of the parameters will be required, others optional.
Let’s have a look at the nf-core/rnaseq pipeline:
...
All the parameters available will also be listed under the Parameters section:
...
Exercise 1
Using the usage and parameters sections, search how many aligner options are available for the nf-core rnaseq pipeline version 3.14.0 .
Expand | ||
---|---|---|
| ||
There are 3 aligner algorithms available: 'star_salmon', 'star_rsem' and 'hisat2'. |
Specifying parameters on the command line
Parameters are generally specified on the CLI (i.e. command line interface).
Code Block |
---|
nextflow run nf-core/rnaseq -profile singularity -resume
--input samplesheet.csv \
--outdir results \
--genome GRCm38 \
--aligner star_salmon \
--extra_trimgalore_args "--quality 30 --clip_r1 10 --clip_r2 10 --three_prime_clip_r1 1 --three_prime_clip_r2 1 " |
Nextflow caching
Structure of work folder
...