...
...
...
...
...
...
...
...
Table of Contents |
---|
Overview
Nextflow is a pipeline engine that can take advantage of the batch nature of the HPC environment to efficiently and quickly run Bioinformatic workflows.
...
The ampliseq pipeline is a bioinformatics analysis pipeline used for 16S rRNA amplicon sequencing data. It combines multiple 16S analysis tools in a single pipeline, to produce a variety of statistical and quantitative outputs, that can be used in further analysis or in publications.
https://github.com/nf-core/ampliseq
...
Install NextFlow locally
...
Installing NextFlow on your HPC account
NextFlow needs to be set up locally for each user account on the HPC. Instructions for installing and setting up NextFlow for your account are here:
Follow the instructions in the above link, then when you have successfully run the NextFlow test (nextflow run hello
generates Hello world!
, etc), then run a test of the ampliseq pipeline (next section)
Running your NextFlow pipelines
NextFlow should never be run on the head node on the HPC (i.e. the node you are automatically logged on to) but should instead be run on a different node using the PBS job scheduler. For instructions on submitting PBS jobs, see here:
Running PBS jobs on the HPC Confluence page
Directory structure
When a NextFlow pipeline is run, it generates multiple directories and output files. We therefore recommend you create a directory where you run all your NextFlow pipelines, so that you don’t have output directories and files scattered across your home directory.
This creates a ‘nextflow’ subdirectory in your home directory. You can then create individual subdirectories for each pipeline you run (e.g. cd nextflow
, mkdir apliseq_test
).
Code Block |
---|
cd ~
mkdir nextflow |
Alternative to submitting PBS job: interactive session.
Run tmux first, so job keeps running when you log off.
Code Block |
---|
tmux |
Interactive PBS session:
Code Block |
---|
qsub -I -S /bin/bash -l walltime=168:00:00 -l select=1:ncpus=4:mem=8gb |
...
tmux
...
module load java
...
Test NextFlow: nextflow run nf-core/ampliseq -profile test,singularity --metadata "Metadata.tsv"
...
failed! Data files listed in metadata.tsv not found. Need local copies for this parameter.
...
Tried without metadata: NextFlow: nextflow run nf-core/ampliseq -profile test,singularity
...
Initially ran, then failed during trimming 'cutadapt: error: pigz: abort: read error on 1_S103_L001_R2_001.fastq.gz (No such file or directory) (exit code 2)
' Looks like cannot pull down files. Need to pull files locally.
...
Keeps failing with different errors every time. HPC issue methinks.
One intemittant error: cannot download metadata.tsv file.
...
Solution: downloading all test files locally, to the root test directory (wget all of them):
...
Then you can run all the following code.
Run an ampliseq test
This test is to see if NextFlow is installed correctly on your account and if you can run ampliseq. It uses a small built-in dataset to run ampliseq, running a full analysis and producing all the output directories and files.
If this test run fails, contact us at eResearch - eresearch@qut.edu.au - and we can examine the issue.
Note that running the test run on the HPC requires some additional steps to those listed on the ampliseq website. Primarily there are issues with ampliseq automatically downloading external files, so we need to download these locally, then change the ampliseq config file to point to these downloaded files.
Downloading test files
There are 3 files, or sets of files, that need to be downloaded first. The command to download each of these is included below.
First, go to your nextflow directory. We are assuming you called it ‘nextflow’ (see above ‘Running your NextFlow pipelines’ section for directory structure suggestions).
Go to your nextflow directory and create a subdirectory called ‘ampliseq_test’
Code Block |
---|
cd ~/nextflow
mkdir ampliseq_test
cd ampliseq_test |
Download the 3 required datafile sets:
1. The metadata file.
Code Block |
---|
wget https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/Metadata.tsv |
2.
...
The taxonomic classifier file.
Code Block |
---|
wget https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/GTGYCAGCMGCCGCGGTAA-GGACTACNVGGGTWTCTAAT-gg_13_8-85-qiime2_2019.7-classifier.qza |
3.
...
The test datafiles (fastq files
...
). Note that the first line below is a very long one - it just adds all the datafile download paths to a test file, which can then be used in the wget command below it.
Code Block |
---|
printf 'https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1_S103_L001_R1_001.fastq.gz |
...
\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1_S103_L001_R2_001.fastq.gz |
...
\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1a_S103_L001_R1_001.fastq.gz |
...
\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1a_S103_L001_R2_001.fastq.gz |
...
\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/2_S115_L001_R1_001.fastq.gz |
...
\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/ |
...
2a_S115_L001_R2_001.fastq.gz |
...
\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/ |
...
2_S115_L001_ |
...
R2_001.fastq.gz |
...
\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/2a_S115_L001_ |
...
R1_001.fastq.gz' |
...
Creating custom test.config file that points to these local files
...
NOTE: we should have a standard place on the HPC for these files, that we can point (or symlink) to in the local nextflow.config script.
...
> datafiles.txt
wget -i datafiles.txt |
Ampliseq test config file
NextFlow pipelines have a series of default settings, which can be overridden by modifying a config file. The default config file for ampliseq points to downloadable datafile locations. As we’ve downloaded the datafiles locally, we need to modify the config file to point to these local files instead.
Make sure you are in the directory you created for this test (containing the downloaded test datafiles) - cd ~/nextflow/ampliseq_test
Load a text editor first. Nano will do.
Code Block |
---|
module load nano |
Create and edit a ‘nextflow.config’ file
Code Block |
---|
nano nextflow.config |
This will create and open an empty text file in nano. Into this file copy and paste the following:
Code Block |
---|
params { classifier = "GTGYCAGCMGCCGCGGTAA-GGACTACNVGGGTWTCTAAT-gg_13_8-85-qiime2_2019.7-classifier.qza" |
...
metadata = "Metadata.tsv" |
...
readPaths = [ |
...
['1_S103', ['1_S103_L001_R1_001.fastq.gz', '1_S103_L001_R2_001.fastq.gz']], |
...
['1a_S103', ['1a_S103_L001_R1_001.fastq.gz', '1a_S103_L001_R2_001.fastq.gz']], |
...
['2_S115', ['2_S115_L001_R1_001.fastq.gz', '2_S115_L001_R2_001.fastq.gz']], |
...
['2a_S115', ['2a_S115_L001_R1_001.fastq.gz', '2a_S115_L001_R2_001.fastq.gz']] ] |
...
It worked! NextFlow has a connectivity issue, at least on the HPC.
6. Run NextFlow tower
7.
...
} |
In nano you can then save the file by ‘ctrl o’ and then exit nano with ‘ctrl x’
Ampliseq test command
Run the following command to test the ampliseq pipeline.
Code Block |
---|
nextflow run nf-core/ampliseq -profile test,singularity --metadata "Metadata.tsv" |
This will submit multiple PBS jobs, for each test file and analysis step. How fast this will run depends on how large the queue is for using the HPC. If there is no queue (rare) the test run will finish in approx 30 minutes. regardless, this does not need to be actively monitored. You can check later to see if the test run succeeded or failed.
Ampliseq output
As can be seen in the test results (see above section), ampliseq produces several directories, containing various analyses outputs. These are outlined here:
https://nf-co.re/ampliseq/1.1.3/output
Briefly, these include quality control and a variety of taxonomic diversity and abundance analyses, using QIIME2 and DADA2 as core analysis tools.
These directories contain various tables and figures that can be used in either downstream analysis or directly in publications.
Running NextFlow’s ampliseq pipeline
Mahsa’s data…:
Downloaded Silva Db from: https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip
...
Code Block |
---|
nextflow run nf-core/ampliseq -profile singularity --manifest manifest.txt |
Q for Craig:
Do we need to add any of this to .nextflow/config file? Perhaps just for Tower?
process {
executor = 'pbspro'
scratch = 'true'
beforeScript = {
"""
mkdir -p /data1/whatmorp/singularity/mnt/session
source $HOME/.bashrc
source $HOME/.profile
"""
}
}
...
If you haven’t been set up or have used the HPC previously, click on this link for information on how to get access to and use the HPC:
Need a link here for HPC access and usage
Creating a shared workspace on the HPC
...
To request a node using PBS, submit a shell script containing your RAM/CPU/analysis time requirements and the code needed to run your analysis. For an overview of submitting a PBS job, see here:
Need a link here for creating PBS jobs
Alternatively, you can start up an ‘interactive’ node, using the following:
...