Overview
Nextflow is a pipeline engine that can take advantage of the batch nature of the HPC environment to efficiently and quickly run Bioinformatic workflows.
For more information about Nextflow, please visit Nextflow - A DSL for parallel and scalable computational pipelines
The ampliseq pipeline is a bioinformatics analysis pipeline used for 16S rRNA amplicon sequencing data.
https://github.com/nf-core/ampliseq
Install NextFlow locally
PBS session: qsub -I -S /bin/bash -l walltime=168:00:00 -l select=1:ncpus=4:mem=8gb
tmux
module load java
Test NextFlow: nextflow run nf-core/ampliseq -profile test,singularity --metadata "Metadata.tsv"
failed! Data files listed in metadata.tsv not found. Need local copies for this parameter.
Tried without metadata: NextFlow: nextflow run nf-core/ampliseq -profile test,singularity
Initially ran, then failed during trimming '
cutadapt: error: pigz: abort: read error on 1_S103_L001_R2_001.fastq.gz (No such file or directory) (exit code 2)
' Looks like cannot pull down files. Need to pull files locally.Keeps failing with different errors every time. HPC issue methinks.
One intemittant error: cannot download metadata.tsv file.
Solution: downloading all test files locally, to the root test directory (wget all of them):
1. Metadata file: https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/Metadata.tsv
2.
Classifier: https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/GTGYCAGCMGCCGCGGTAA-GGACTACNVGGGTWTCTAAT-gg_13_8-85-qiime2_2019.7-classifier.qza
3. Fastq files. ['1_S103', ['https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1_S103_L001_R1_001.fastq.gz', 'https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1_S103_L001_R2_001.fastq.gz']], ['1a_S103', ['https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1a_S103_L001_R1_001.fastq.gz', 'https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1a_S103_L001_R2_001.fastq.gz']], ['2_S115', ['https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/2_S115_L001_R1_001.fastq.gz', 'https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/2_S115_L001_R2_001.fastq.gz']], ['2a_S115', ['https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/2a_S115_L001_R1_001.fastq.gz', 'https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/2a_S115_L001_R2_001.fastq.gz']] ]}
Creating custom test.config file that points to these local files
NOTE: we should have a standard place on the HPC for these files, that we can point (or symlink) to in the local nextflow.config script.
params {
classifier = "GTGYCAGCMGCCGCGGTAA-GGACTACNVGGGTWTCTAAT-gg_13_8-85-qiime2_2019.7-classifier.qza"
metadata = "Metadata.tsv"
readPaths = [
['1_S103', ['1_S103_L001_R1_001.fastq.gz', '1_S103_L001_R2_001.fastq.gz']],
['1a_S103', ['1a_S103_L001_R1_001.fastq.gz', '1a_S103_L001_R2_001.fastq.gz']],
['2_S115', ['2_S115_L001_R1_001.fastq.gz', '2_S115_L001_R2_001.fastq.gz']],
['2a_S115', ['2a_S115_L001_R1_001.fastq.gz', '2a_S115_L001_R2_001.fastq.gz']]
]
}
It worked! NextFlow has a connectivity issue, at least on the HPC.
6. Run NextFlow tower
7.
8 On to testing Mahsa’s data…:
Downloaded Silva Db from: https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip
Nextflow.config params:
params {
max_cpus=32
max_memory=512.GB
max_time = 48.h
input="/work/rumen_16S/nextflow/16S/fastq"
extension="/*.fastq.gz"
FW_primer = "TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG"
RV_primer = "GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC"
metadata = "metadata.txt"
reference_database = "/home/whatmorp/Annette/16S/Silva_132_release.zip"
retain_untrimmed = true
}
Pointing to the place I DL Silva DB to originally
Pointing to Mahsa’s fastq files dir
metadata.txt file and manifest.tst files already created (need to add details here)
Running with
nextflow run nf-core/ampliseq -profile singularity --manifest manifest.txt
Q for Craig:
Do we need to add any of this to .nextflow/config file? Perhaps just for Tower?
process {
executor = 'pbspro'
scratch = 'true'
beforeScript = {
"""
mkdir -p /data1/whatmorp/singularity/mnt/session
source $HOME/.bashrc
source $HOME/.profile
"""
}
}
singularity {
cacheDir = '/home/whatmorp/NXF_SINGULARITY_CACHEDIR'
autoMounts = true
}
conda {
cacheDir = '/home/whatmorp/NXF_CONDA_CACHEDIR'
}
tower {
accessToken = 'c7f8cc62b24155c0150a6ce4b6db15946bfc19ef'
endpoint = 'https://nftower.qut.edu.au/api'
enabled = true
}
Installing NextFlow
Follow the eResearch wiki entry for installing NextFlow:
Basic pipeline
This section provides a skeleton overview of the pipeline commands. if you’ve installed the tools and run this pipeline before, you can simply run these commands on your data. if this is your first time running this analysis, it’s strongly recommended you read the following sections of this article, as they provide important background information.
Analysis overview
Pipeline overview
This section provides a skeleton overview of the pipeline commands. if you’ve run this pipeline before and understand the following sections, you can simply run these commands on your data. if this is your first time running this analysis, it’s strongly recommended you read the following sections of this article, as they provide important background information.
Running analysis on QUTs HPC
NextFlow is designed to run on a Linux server. It makes use of server clusters and batch systems such as PBS to split up analysis into multiple jobs, vastly streamlining and speeding up analysis time. To get the most of out NextFlow you should run it on QUTs high performance cluster (HPC).
Accessing the HPC
QUT staff and HDR students can run NextFlow on QUTs high performance computing cluster.
If you haven’t been set up or have used the HPC previously, click on this link for information on how to get access to and use the HPC:
Need a link here for HPC access and usage
Creating a shared workspace on the HPC
If you already have access to the HPC, we strongly recommend that you request a shared directory to store and analyse your data. This is so yourself, and relevant members of your research team (e.g. your HPC supervisor) and eResearch bioinformaticians can access your data and assist with the analysis.
To request a shared directory, submit a request to HPC support, telling them what you want the directory to be called (e.g. your_name_16S) and who you want to access it.
https://eresearchqut.atlassian.net/servicedesk/customer/portals
Running your analysis using the Portable Batch System (PBS)
The HPC has multiple ‘nodes’ available, each of which are allocated certain amounts of RAM and CPU cores.
You should run your data on one of these nodes, with a suitable amount of RAM and cores for your analysis needs. When you log on to the HPC you automatically are in the ‘head node’.
Do not run any of your analysis on the head node. Always request a node through the PBS.
To request a node using PBS, submit a shell script containing your RAM/CPU/analysis time requirements and the code needed to run your analysis. For an overview of submitting a PBS job, see here:
Need a link here for creating PBS jobs
Alternatively, you can start up an ‘interactive’ node, using the following:
qsub -I -S /bin/bash -l walltime=72:00:00 -l select=1:ncpus=4:mem=8gb
This asks for a node with 4 CPUs and 8GB of memory, that will run for 72hrs. This may seem inadequate for a full analysis, but NextFlow will automatically start PBS sessions and allocate them sufficient memory for each analysis step.
Note that when you request a node, you are put into a queue until a node of those specifications becomes available. Depending on how many people are using the HPC and how much resources they are using, this may take a while, in which case you just need to wait until the job starts (“waiting for job xxxx.pbs to start”).
Once the interactive PBS job begins (you get a command prompt) you can run the following analysis: