Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
stylenone

...

  • Nextflow is a free and open-source pipeline management software that enables scalable and reproducible scientific workflows. It allows the adaptation of pipelines written in the most common scripting languages.

  • Key features of Nextflow:

    • Reproducible → version control and use of containers ensure the reproducibility of nextflow pipelines

    • Portable → compute agnostic (i.e., HPC, cloud, desktop)

    • Scalable → run from a single to thousands of samples

    • Minimal digital literacy → accessible to anyone

    • Active global community → more and more nextflow pipelines are available (i.e., https://nf-co.re/pipelines )

...

To install Nextflow, copy and paste the following block of code into your terminal (i.e., PuTTy that is already connected to the terminal) and hit 'enter':

Code Block
module load java
curl -s https://get.nextflow.io | bash
mv nextflow $HOME/bin
  • Line 1: The module load command is necessary to ensure java is available

  • Line 2: This command downloads and assembles the parts of nextflow - this step might take some time.

  • Line 3: When finished, the nextflow binary will be in the current folder so it should be moved to your “bin” folder” so it can be found later.

To verify that Nextflow is installed properly, you can run the following command:

...

Code Block
mkdir $HOME/nftemp && cd $HOME/nftemp
nextflow run hello
  • Line 1: Make a temporary folder for Nextflow to create files when it runs.

  • Line 2: Verify Nextflow is working.

You should see something like this:

...

Code Block
[[ -d $HOME/.nextflow ]] || mkdir -p $HOME/.nextflow

cat <<EOF > $HOME/.nextflow/config
singularity {
    cacheDir = '$HOME/.nextflow/NXF_SINGULARITY_CACHEDIR'
    autoMounts = true
}
conda {
    cacheDir = '$HOME/.nextflow/NXF_CONDA_CACHEDIR'
}
process {
  executor = 'pbspro'
  scratch = false
  cleanup = false
}
includeConfig '/work/datasets/reference/nextflow/qutgenome.config'
EOF
  • Line 1: Check if a .nextflow/config file already exists in your home directory. Create it if it does not exist

  • Line 2-15: Using the cat command, paste text in the newly created .nextflow/config file which specifies the cache location for your singularity and conda.

  • What are the parameters you are setting?

  • Line 3-6 set the directory where remote Singularity images are stored and direct Nextflow to automatically mount host paths in the executed container.

  • Line 7-9 set the directory where Conda environments are stored.

  • Line 10-14 sets default directives for processes in your pipeline. Note that the executor is set to pbspro on line 11.

  • Line 15 provides the local path to genome files required for pipelines such as nf-core/rnaseq

More in depth information on Nextflow configuration is described here: https://www.nextflow.io/docs/latest/config.html.

...

At QUT, we use singularity so we would specify: -profile singularity.

...

Install and test that the pipeline installed successfully

Pipelines generally include test code that can be run to make sure installation was successful.

From the command line

By running the Nexflow pipeline on the command line, the progress of the analysis is captured in real-time.

Run the following command from your home directory:

...

It will first display the version of the pipeline which was downloaded: version 2.1.0.

It will then list all the parameters that differ from the pipeline default.

...

Before running a process, it will download the required singularity image. By running the Nexflow pipeline on the command line, the progress of the analysis is captured in real-life.

In the screenshot below, all the jobs which will be run are listed.

...

At the bottom, the message ‘Pipeline completed successfully’ will be printed along with the duration, the CPU hours and numbers of jobs that run to completion.

...

Launching Nextflow using a PBS script

[Get them to run a PBS script themselves]Launching the Nextflow pipeline from the command line enabled us to understand what the pipeline does in real-time. But you have to make sure you keep the terminal page from which you launched the analysis opened until the analysis is done.

So now that you have learnt how to run Nextflow locally, we will use a PBS script to launch the analysis.

Move back into your home directory.

Create a test.sh script by running the following command:

Code Block
cat <<EOF > $HOME/smrnaseq_test.sh
#!/bin/bash -l
#PBS -N ontvisc
#PBS -l select=1:ncpus=2:mem=8gb
#PBS -l walltime=1:00:00

cd $PBS_O_WORKDIR
module load java
NXF_OPTS='-Xms1g -Xmx4g'
nextflow run nf-core/smrnaseq -profile test,singularity --outdir results -r 2.1.0
EOF

Make the command executable and then submit your job to the PBS queue by running the following commands:

Code Block
chmod +x smrnaseq_test.sh
qsub smrnaseq_test.sh

Input specifications

Samplesheet input

...

Column names has to be specified in a header row as shown in the samplesheet example below:

...

sample,fastq_1
Clone1_N1,s3://ngi-igenomes/test-data/smrnaseq/C1-N1-R1_S4_L001_R1_001.fastq.gz
Clone1_N3,s3://ngi-igenomes/test-data/smrnaseq/C1-N3-R1_S6_L001_R1_001.fastq.gz
Clone9_N1,s3://ngi-igenomes/test-data/smrnaseq/C9-N1-R1_S7_L001_R1_001.fastq.gz
Clone9_N2,s3://ngi-igenomes/test-data/smrnaseq/C9-N2-R1_S8_L001_R1_001.fastq.gz
Clone9_N3,s3://ngi-igenomes/test-data/smrnaseq/C9-N3-R1_S9_L001_R1_001.fastq.gz
Control_N1,s3://ngi-igenomes/test-data/smrnaseq/Ctl-N1-R1_S1_L001_R1_001.fastq.gz
Control_N2,s3://ngi-igenomes/test-data/smrnaseq/Ctl-N2-R1_S2_L001_R1_001.fastq.gz
Control_N3,s3://ngi-igenomes/test-data/smrnaseq/Ctl-N3-R1_S3_L001_R1_001.fastq.gz

...

For the nf-core/rnaseq pipeline, the samplesheet has to be a comma-separated file with the following 4 columns:

...

Column names has to be specified in a header row as shown in the samplesheet example below:

...

sample,fastq_1,fastq_2,strandedness
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,auto

...

Please note that in this example, the same sample (CONTROL_REP1) was sequenced across 3 lanes. The nf-core/sarek pipeline will concatenate the raw reads before performing any downstream analysis.

...

We can use the Bash tree command to list the contents of the work directory. Note: By default tree does not print hidden files (those beginning with a dot .). Use the -a to view all files.

Code Block
tree -a work

Provide a relevant example from test run

Example of work directory:

...