Fetching pipeline code
The pull
command allows you to download the latest version of a project from a GitHub repository or to update it if that repository has previously been downloaded in your home directory.
nextflow pull nf-core/<pipeline>
Please note that Nextflow would also automatically fetch the pipeline code when you run the command below for the first time:
nextflow run nf-core/<pipeline>
For reproducibility, it is good to explicitly reference the pipeline version number that you wish to use with the -revision
/-r
flag.
In the example below we are pulling the rnaseq pipeline version 3.12.0
nextflow pull nf-core/rnaseq -revision 3.12.0
Downloaded pipeline projects are stored in the folder $HOME/.nextflow/assets
.
Software requirements for pipelines
Nextflow pipeline software dependencies are specified using either Docker, Singularity or Conda. It is Nextflow that handles the downloading of containers and creation of conda environments. This is set using the -profile {docker,singularity,conda}
parameter when you run Nextflow.
At QUT, we use singularity so we would specify: -profile singularity
.
Install and test that the pipeline installed successfully
Pipelines generally include test code that can be run to make sure installation was successful.
From the command line
By running the Nexflow pipeline on the command line, the progress of the analysis is captured in real-time.
As a first exercise we will download and run the nf-core/smrnaseq which is a bioinformatics best-practice analysis pipeline for Small RNA-Seq. We will use the test data provided by the developers to ensure the pipeline installed successfully. This control dataset contains 8 samples.
Run the following command from your home directory:
cd $HOME/workshop/2024-2/session3 mkdir smrnaseq_cl cd smrnaseq_cl export NXF_OPTS='-Xms1g -Xmx4g' nextflow pull file:///work/training/smrnaseq nextflow run file:///work/training/smrnaseq -profile test,singularity --outdir results -r 2.3.1
Line 1: Move to the directory created for this workshop.
Line 2: Make a temporary folder called smrnaseq_cl for Nextflow to test the smrnaseq pipeline.
Line 3: Change directory to the newly created folder smrnaseq_cl.
Line 4: In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. We recommend adding the following line to your environment to limit this.
Line 5: Download and run the test code.
This will download the smrnaseq pipeline and then run the test code. It should take ~20-30 minutes to run to completion.
Nextflow will first download the pipeline:
It will then display the version of the pipeline which was downloaded: version 2.3.1.
It will also list all the parameters that differ from the pipeline default.
Before running a process, it will download the required singularity images and required reference and input files for testing.
In the screenshot below, all the jobs which will be run are listed.
We can see that 7 jobs have started:
FASTQ_FASTQC_UMITOOLS_FASTP:FASTQC_RAW: 3 jobs are running, the latest job that started is for sample Clone9_N1
FASTQ_FASTQC_UMITOOLS_FASTP:FASTP: 3 jobs are running, the latest job that started is for sample Clone9_N1
INDEX_GENOME (genome.fa): 1 job has started
At the bottom you can see that 6 files (including test fastq.gz input files and reference files) have also been downloaded.
The jobs have been submitted to the PBS queue.
You can check the full list of jobs that have been submitted at any point in time by opening a separate terminal and using the command:
qstat -u $user
It will display the jobs that have been submitted to the PBS queue.
For example, in the first screenshot, 3 jobs are queued.
And in the second screenshot, 6 jobs are running and 2 are queued.
Not familiar with checking PBS job status? Please refer to Checking on the Job Status section in the Intro to HPC.
Going back to the terminal from which you launched the Nextflow analysis, you can check the nextflow log to see how the analysis is progressing.
For example in the screenshot below, taken half way through the Nextflow analysis, several processes have run to completion for all 8 samples tested.
For example, process FASTQ_FASTQC_UMITOOLS_FASTP:FASTQC_RAW appears as 100% completed (i.e 8 of 8 samples).
The process NFCORE_SMRNASEQ:MIRNA_QUANT:BOWTIE_MAP_MATURE is running for sample Control_N2_mature. This is the first sample of the batch to go through this process.
The process NFCORE_SMRNASEQ:MIRNA_QUANT:BOWTIE_MAP_SEQCLUSTER has already completed for 2 samples and it is running for a third sample Clone9_N3_seqcluster.
This is the output you should get when your Nextflow job has run to completion.
At the bottom, the message ‘Pipeline completed successfully’ will be printed along with the duration, the CPU hours and numbers of jobs that run to completion.
You will see that Nextflow created 2 folders (results and work) if you run the command
ls
You can inspect the results which have been output by typing:
ls results
You will see that the pipeline has placed results under different folders matching the steps/processes that were run:
bbsplit edger fastp fastqc genome index mirtop mirtrace multiqc pipeline_info salmon samtools star_salmon trimgalore unmapped
You can browse a couple of results folders to check what sort of outputs were generated by the pipeline.
Tip: If you are having trouble running the nf-core/smrnaseq pipeline, some pre-computed results are provided under the folder /work/training/nextflow_intro/smrnaseq_cl.
You will learn more about how to run the nf-core/smrnaseq pipeline in session 6.
Launching Nextflow using a PBS script
Launching the Nextflow pipeline from the command line enabled us to understand what the pipeline does in real-time. But you have to make sure you keep the terminal page from which you launched the analysis opened until the analysis is done.
So now that you have learnt how to run Nextflow locally, we will use a PBS script to download the nf-core/rnaseq pipeline and test it using their test data. This method is the way we recommend you run Nextflow pipelines on the HPC.
This time we will download and run the nf-core/rnaseq, which is a bioinformatics pipeline that can be used to analyse RNA sequencing data obtained from organisms with a reference genome and annotation. We will also test it using the test data provided by the developers.
Move back into your home directory and create a separate rnaseq_pbs
folder:
mkdir -p $HOME/workshop/2024-2/session3/rnaseq_pbs cd $HOME/workshop/2024-2/session3/rnaseq_pbs
Create the script file rnaseq_test.sh
by running the following command:
cat <<EOF > rnaseq_test.sh #!/bin/bash -l #PBS -N nfrnaseq_test #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=6:00:00 cd \$PBS_O_WORKDIR module load java NXF_OPTS='-Xms1g -Xmx4g' nextflow run nf-core/rnaseq -r 3.14.0 -profile test,singularity --outdir results EOF
Line 3: Set your PBS job name to be
nfrnaseq_test
Line 4: Specify memory and CPU resource that you want to allocate to your job
Line 5: Specify that you want to allocate 6h for your job to run to.completion
Line 7: Change directory to $PBS_O_WORKDIR, which is a special environment variable created by PBS. This will be the folder where you ran the qsub command
Line 8: Load java
Line 9: In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. We recommend adding the following line to your environment to limit this
Line 10: Run the
nf-core/rnaseq
pipeline using the test data provided
You can check the content of the PBS script you just created using the command:
cat rnaseq_test.sh
Make the command executable and then submit your job to the PBS queue by running the following commands:
chmod +x rnaseq_test.sh qsub rnaseq_test.sh
Once again you can monitor your jobs using the qstat -u $user
command.
The test should take ~ 30 min to run.
Once completed, you can check the content of the folder using the command ls
You will see a results folder, along with 2 pbs log files: nfrnaseq_test.e[pbs_job_id] and nfrnaseq_test.o[pbs_job_id].
Not familiar with checking the output of PBS job, review the Checking the Output section of Submitting PBS Jobs part 2 from Intro to HPC.
The nfrnaseq_test.o* file will provide a log of the nextflow processes. If you scroll at the bottom, you can check whether the pipeline ran successfully. You should see something similar to this:
-[nf-core/rnaseq] Pipeline completed successfully - Completed at: 23-Sep-2024 17:29:26 Duration : 32m 11s CPU hours : 0.7 Succeeded : 194 PBS Job 10374429.pbs CPU time : 00:02:15 Wall time : 00:32:35 Mem usage : 1032552kb
If you are having trouble running the nf-core/rnaseq pipeline, some pre-computed results are provided under the folder /work/training/nextflow_intro/rnaseq_pbs.
You will learn more about how to run the nf-core/rnaseq pipeline in session 4.
More details about options which can be used with the nextflow command can be found here.