Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

...

...

...

...

...

...

Table of Contents

Overview

Nextflow is a pipeline engine that can take advantage of the batch nature of the HPC environment to efficiently and quickly run Bioinformatic workflows.

...

The ampliseq pipeline is a bioinformatics analysis pipeline used for 16S rRNA amplicon sequencing data. It combines multiple 16S analysis tools in a single pipeline, to produce a variety of statistical and quantitative outputs, that can be used in further analysis or in publications.

https://nf-co.re/ampliseq

https://github.com/nf-core/ampliseq

...

Install NextFlow locally

...

Installing NextFlow on your HPC account

NextFlow needs to be set up locally for each user account on the HPC. Instructions for installing and setting up NextFlow for your account are here:

Nextflow

Follow the instructions in the above link, then when you have successfully run the NextFlow test (nextflow run hello generates Hello world!, etc), then run a test of the ampliseq pipeline (next section)

Running your NextFlow pipelines

NextFlow should never be run on the head node on the HPC (i.e. the node you are automatically logged on to) but should instead be run on a different node using the PBS job scheduler. For instructions on submitting PBS jobs, see here:

Running PBS jobs on the HPC Confluence page

Directory structure

When a NextFlow pipeline is run, it generates multiple directories and output files. We therefore recommend you create a directory where you run all your NextFlow pipelines, so that you don’t have output directories and files scattered across your home directory.

This creates a ‘nextflow’ subdirectory in your home directory. You can then create individual subdirectories for each pipeline you run (e.g. cd nextflow, mkdir apliseq_test).

Code Block
cd ~
mkdir nextflow

Alternative to submitting PBS job: interactive session.

Run tmux first, so job keeps running when you log off.

Code Block
tmux

Interactive PBS session:

Code Block
qsub -I -S /bin/bash -l walltime=168:00:00 -l select=1:ncpus=4:mem=8gb

...

tmux

...

module load java

...

Test NextFlow: nextflow run nf-core/ampliseq -profile test,singularity --metadata "Metadata.tsv"

...

failed! Data files listed in metadata.tsv not found. Need local copies for this parameter.

...

Tried without metadata: NextFlow: nextflow run nf-core/ampliseq -profile test,singularity

...

Initially ran, then failed during trimming 'cutadapt: error: pigz: abort: read error on 1_S103_L001_R2_001.fastq.gz (No such file or directory) (exit code 2)' Looks like cannot pull down files. Need to pull files locally.

...

Keeps failing with different errors every time. HPC issue methinks.

One intemittant error: cannot download metadata.tsv file.

...

Solution: downloading all test files locally, to the root test directory (wget all of them):

...

Then you can run all the following code.

Run an ampliseq test

This test is to see if NextFlow is installed correctly on your account and if you can run ampliseq. It uses a small built-in dataset to run ampliseq, running a full analysis and producing all the output directories and files.

If this test run fails, contact us at eResearch - eresearch@qut.edu.au - and we can examine the issue.

Note that running the test run on the HPC requires some additional steps to those listed on the ampliseq website. Primarily there are issues with ampliseq automatically downloading external files, so we need to download these locally, then change the ampliseq config file to point to these downloaded files.

Downloading test files

There are 3 files, or sets of files, that need to be downloaded first. The command to download each of these is included below.

First, go to your nextflow directory. We are assuming you called it ‘nextflow’ (see above ‘Running your NextFlow pipelines’ section for directory structure suggestions).

Go to your nextflow directory and create a subdirectory called ‘ampliseq_test’

Code Block
cd ~/nextflow
mkdir ampliseq_test
cd ampliseq_test

Download the 3 required datafile sets:

1. The metadata file.

Code Block
wget https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/Metadata.tsv

2.

...

The taxonomic classifier file.

Code Block
wget https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/GTGYCAGCMGCCGCGGTAA-GGACTACNVGGGTWTCTAAT-gg_13_8-85-qiime2_2019.7-classifier.qza

3.

...

The test datafiles (fastq files

...

). Note that the first line below is a very long one - it just adds all the datafile download paths to a test file, which can then be used in the wget command below it.

Code Block
printf 'https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1_S103_L001_R1_001.fastq.gz

...

\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1_S103_L001_R2_001.fastq.gz

...

\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1a_S103_L001_R1_001.fastq.gz

...

\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1a_S103_L001_R2_001.fastq.gz

...

\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/2_S115_L001_R1_001.fastq.gz

...

\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/

...

2a_S115_L001_R2_001.fastq.gz

...

\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/

...

2_S115_L001_

...

R2_001.fastq.gz

...

\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/2a_S115_L001_

...

R1_001.fastq.gz'

...

Creating custom test.config file that points to these local files

...

NOTE: we should have a standard place on the HPC for these files, that we can point (or symlink) to in the local nextflow.config script.

...

 > datafiles.txt
wget -i datafiles.txt

Ampliseq test config file

NextFlow pipelines have a series of default settings, which can be overridden by modifying a config file. The default config file for ampliseq points to downloadable datafile locations. As we’ve downloaded the datafiles locally, we need to modify the config file to point to these local files instead.

Make sure you are in the directory you created for this test (containing the downloaded test datafiles) - cd ~/nextflow/ampliseq_test

Load a text editor first. Nano will do.

Code Block
module load nano

Create and edit a ‘nextflow.config’ file

Code Block
nano nextflow.config

This will create and open an empty text file in nano. Into this file copy and paste the following:

Code Block
params {
  classifier = "GTGYCAGCMGCCGCGGTAA-GGACTACNVGGGTWTCTAAT-gg_13_8-85-qiime2_2019.7-classifier.qza"

...


  metadata = "Metadata.tsv"

...


  readPaths = [

...


    ['1_S103', ['1_S103_L001_R1_001.fastq.gz', '1_S103_L001_R2_001.fastq.gz']],

...


    ['1a_S103', ['1a_S103_L001_R1_001.fastq.gz', '1a_S103_L001_R2_001.fastq.gz']],

...


    ['2_S115', ['2_S115_L001_R1_001.fastq.gz', '2_S115_L001_R2_001.fastq.gz']],

...


    ['2a_S115', ['2a_S115_L001_R1_001.fastq.gz', '2a_S115_L001_R2_001.fastq.gz']]
  ]

...

  • It worked! NextFlow has a connectivity issue, at least on the HPC.

6. Run NextFlow tower

7.

...


}

In nano you can then save the file by ‘ctrl o’ and then exit nano with ‘ctrl x’

Ampliseq test command

Run the following command to test the ampliseq pipeline.

Code Block
nextflow run nf-core/ampliseq -profile test,singularity --metadata "Metadata.tsv"

This will submit multiple PBS jobs, for each test file and analysis step. How fast this will run depends on how large the queue is for using the HPC. If there is no queue (rare) the test run will finish in approx 30 minutes. regardless, this does not need to be actively monitored. You can check later to see if the test run succeeded or failed.

Ampliseq output

As can be seen in the test results (see above section), ampliseq produces several directories, containing various analyses outputs. These are outlined here:

https://nf-co.re/ampliseq/1.1.3/output

Briefly, these include quality control and a variety of taxonomic diversity and abundance analyses, using QIIME2 and DADA2 as core analysis tools.

These directories contain various tables and figures that can be used in either downstream analysis or directly in publications.

Running NextFlow’s ampliseq pipeline

Mahsa’s data…:

Downloaded Silva Db from: https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip

...

Code Block
nextflow run nf-core/ampliseq -profile singularity --manifest manifest.txt

Q for Craig:

Do we need to add any of this to .nextflow/config file? Perhaps just for Tower?

process {
executor = 'pbspro'
scratch = 'true'
beforeScript = {
"""
mkdir -p /data1/whatmorp/singularity/mnt/session
source $HOME/.bashrc
source $HOME/.profile
"""
}
}

...

If you haven’t been set up or have used the HPC previously, click on this link for information on how to get access to and use the HPC:

Need a link here for HPC access and usage 

Creating a shared workspace on the HPC

...

To request a node using PBS, submit a shell script containing your RAM/CPU/analysis time requirements and the code needed to run your analysis. For an overview of submitting a PBS job, see here:

Need a link here for creating PBS jobs

Alternatively, you can start up an ‘interactive’ node, using the following:

...