nf-core/bactmap: A mapping-based pipeline for creating a phylogeny from bacterial whole genome sequences

This page provides a guide to QUT users on how to install and run the nextflow nf-core/bactmap workflow on the HPC.

Pre-requisites

Basic unix command line knowledge (example: https://researchcomputing.princeton.edu/education/external-online-resources/linux ; https://swcarpentry.github.io/shell-novice/ )
Familiarity with one unix text editors (example Vi/Vim or Nano):
- VIM ( https://bioinformatics.uconn.edu/vim-guide/ ; https://missing.csail.mit.edu/2020/editors/)
- Nano (https://engineering.purdue.edu/ECN/Support/KB/Docs/BasictutorialforNanou ; https://www.howtogeek.com/howto/42980/the-beginners-guide-to-nano-the-linux-command-line-text-editor/ )
Have an HPC account on QUT’s lyra. Apply for a new HPC account here.

Install Nextflow

The nf-core/bactmap workflow requires Nextflow to be installed in your account on the HPC. Find details on how to install and test Nextflow here prepare a nextflow.config file and run a PBS pro submission script for Nextflow pipelines.

Additional information available here: https://nf-co.re/usage/installation

Additional details on the workflow can be found at:

Overview: https://nf-co.re/bactmap/1.0.0

Usage: https://nf-co.re/bactmap/1.0.0/usage

Interactive session on the HPC

qsub -I -S /bin/bash -l walltime=10:00:00 -l select=1:ncpus=2:mem=4gb

SRA TOOLKIT

Use singularity container to fetch public data on the HPC:

One file at a time:

singularity run docker://ncbi/sra-tools:latest prefetch SRR1198667

singularity run docker://ncbi/sra-tools:latest fastq-dump -X 1000000 -I --split-files SRR1198667

2. use a list:

singularity run docker://ncbi/sra-tools:latest prefetch --option-file SraAccList.txt

singularity run docker://ncbi/sra-tools:latest fastq-dump -X 1000000 -I --split-files SRR1198667

compress the fastq files

gzip -c filename.fastq > filename.fastq.gz

batch:

for file in `ls *.fastq`; do echo $file; gzip -c $file > ${file}.gz; done

Getting Started

Download and run the workflow using a minimal data provided by nf-core/bactmap. We recommend using singularity as the profile for QUT’s HPC. Note: the profile option ‘docker’ is not available on the HPC.

nextflow run nf-core/bactmap -profile test,singularity

Note: at this time, the test profile will fail to run

Running the test - create a 'launch.pbs' script:

#!/bin/bash -l
#PBS -N nf-bactmap
#PBS -l walltime=24:00:00
#PBS -l select=1:ncpus=1:mem=5gb
cd $PBS_O_WORKDIR
NXF_OPTS='-Xms1g -Xmx4g'

module load java

#run test for bactmap
nextflow run nf-core/bactmap -profile test,singularity

submit the job:

qsub launch.pbs

check the job:

qjobs

Running the pipeline using custom data

Example of a typical command to run a Bactmap analysis:

  nextflow run nf-core/bactmap \
    --input samplesheet.csv \
    --reference chromosome.fasta \
    -profile singularity

Note, if the running was interrupted or did not complete a particular step or you want to modify a parameter for a particular step, instead of re-running all process again nextflow enables to “-resume” the workflow.

nextflow run nf-core/bactmap \
    --input samplesheet.csv \
    --reference chromosome.fasta \
    -profile singularity \
    -resume

Preparing a ‘samplesheet.csv’ file

Prepare an samplesheet.csv file containing the information of the samples to be processed. See below examples of index.csv files.

Example samplesheet.csv:

sample,fastq_1,fastq_2
G18582004,fastqs/G18582004_1.fastq.gz,fastqs/G18582004_2.fastq.gz
G18756254,fastqs/G18756254_1.fastq.gz,fastqs/G18756254_2.fastq.gz
G18582006,fastqs/G18582006_1.fastq.gz,fastqs/G18582006_2.fastq.gz

When specifying the path to the data files, it is more portable to use absolute paths rather than relative paths.

Creating the samplesheet.csv file using Excel can add ascii characters, run the following command to remove them:

dos2unix samplesheet.csv

Preparing to run on the HPC

To run this on the HPC a PBS submission script needs to be created using a text editor. For example, create a file called launch.pbs using a text editor of choice (i.e., vi or nano) and then copy and paste the code below:

#!/bin/bash -l
#PBS -N bactmap01
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00

cd $PBS_O_WORKDIR
module load java
NXF_OPTS='-Xms1g -Xmx4g'

nextflow run nf-core/bactmap \
    --input samplesheet.csv \
    --reference chromosome.fasta \
    -profile singularity

We recommend running the nextflow nf-core/bactmap pipeline once and then assess the results folder to assess if. Then, we can use the PBS script below to ...

#!/bin/bash -l
#PBS -N bactmap02
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=24:00:00

cd $PBS_O_WORKDIR
module load java
NXF_OPTS='-Xms1g -Xmx4g'

nextflow run nf-core/bactmap \
    --input samplesheet.csv \
    --reference chromosome.fasta \
    -profile singularity \
    --trim \ #trim reads
    --remove_recombination \ #remove recombination using gubbins
    --rapidnj \ #build a RapidNJ tree
    --fasttree \ #build a RapidNJ tree
    --iqtree \ #build an IQ-TREE tree
    --raxmlng #build a RAxML-NG tree

Note: The options to the bactmap pipeline can be placed in a nextflow.config file instead.

Submitting the job

Once you have created the folder for the run, the samplesheet.csv file, nextflow.config (optional) and launch.pbs you are ready to submit.

Submit the run with this command (On Lyra)

qsub launch.pbs

Monitoring the Run

You can use the command

qstat -u $USER

Alternatively use the following command:

qjobs

To check on the jobs you are running. Nextflow will launch additional jobs during the run.

You can also check the .nextflow.log file for details on what is going on.

Finally, if you have configured the connection to the NFTower you can logon and check your run.