Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

Overview

Nextflow is a pipeline engine that can take advantage of the batch nature of the HPC environment to efficiently and quickly run Bioinformatic workflows.

For more information about Nextflow, please visit Nextflow - A DSL for parallel and scalable computational pipelines

The ampliseq pipeline is a bioinformatics analysis pipeline used for 16S rRNA amplicon sequencing data. It combines multiple 16S analysis tools in a single pipeline, to produce a variety of statistical and quantitative outputs, that can be used in further analysis or in publications.

https://nf-co.re/ampliseq

https://github.com/nf-core/ampliseq

Installing NextFlow on your HPC account

NextFlow needs to be set up locally for each user account on the HPC. Instructions for installing and setting up NextFlow for your account are here:

Nextflow

Follow the instructions in the above link, then when you have successfully run the NextFlow test (nextflow run hello generates Hello world!, etc), then run a test of the ampliseq pipeline (next section)

Running your NextFlow pipelines

NextFlow should never be run on the head node on the HPC (i.e. the node you are automatically logged on to) but should instead be run on a different node using the PBS job scheduler. For instructions on submitting PBS jobs, see here:

Running PBS jobs on the HPC Confluence page

Directory structure

When a NextFlow pipeline is run, it generates multiple directories and output files. We therefore recommend you create a directory where you run all your NextFlow pipelines, so that you don’t have output directories and files scattered across your home directory.

This creates a ‘nextflow’ subdirectory in your home directory. You can then create individual subdirectories for each pipeline you run (e.g. cd nextflow, mkdir apliseq_test).

Code Block
cd ~
mkdir nextflow

Alternative to submitting PBS job: interactive session.

Run tmux first, so job keeps running when you log off.

Code Block
tmux

Interactive PBS session:

Code Block
qsub -I -S /bin/bash -l walltime=168:00:00 -l select=1:ncpus=4:mem=8gb

Then you can run all the following code.

Run an ampliseq test

This test is to see if NextFlow is installed correctly on your account and if you can run ampliseq. It uses a small built-in dataset to run ampliseq, running a full analysis and producing all the output directories and files.

If this test run fails, contact us at eResearch - eresearch@qut.edu.au - and we can examine the issue.

Note that running the test run on the HPC requires some additional steps to those listed on the ampliseq website. Primarily there are issues with ampliseq automatically downloading external files, so we need to download these locally, then change the ampliseq config file to point to these downloaded files.

Downloading test files

There are 3 files, or sets of files, that need to be downloaded first. The command to download each of these is included below.

First, go to your nextflow directory. We are assuming you called it ‘nextflow’ (see above ‘Running your NextFlow pipelines’ section for directory structure suggestions).

Go to your nextflow directory and create a subdirectory called ‘ampliseq_test’

Code Block
cd ~/nextflow
mkdir ampliseq_test
cd ampliseq_test

Download the 3 required datafile sets:

1. The metadata file.

Code Block
wget https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/Metadata.tsv

2. The taxonomic classifier file.

Code Block
wget https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/GTGYCAGCMGCCGCGGTAA-GGACTACNVGGGTWTCTAAT-gg_13_8-85-qiime2_2019.7-classifier.qza

3. The test datafiles (fastq files). Note that the first line below is a very long one - it just adds all the datafile download paths to a test file, which can then be used in the wget command below it.

Code Block
printf 'https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1_S103_L001_R1_001.fastq.gz\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1_S103_L001_R2_001.fastq.gz\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1a_S103_L001_R1_001.fastq.gz\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1a_S103_L001_R2_001.fastq.gz\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/2_S115_L001_R1_001.fastq.gz\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/2a_S115_L001_R2_001.fastq.gz\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/2_S115_L001_R2_001.fastq.gz\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/2a_S115_L001_R1_001.fastq.gz' > datafiles.txt
wget -i datafiles.txt

Ampliseq test config file

NextFlow pipelines have a series of default settings, which can be overridden by modifying a config file. The default config file for ampliseq points to downloadable datafile locations. As we’ve downloaded the datafiles locally, we need to modify the config file to point to these local files instead.

Make sure you are in the directory you created for this test (containing the downloaded test datafiles) - cd ~/nextflow/ampliseq_test

Load a text editor first. Nano will do.

Code Block
module load nano

Create and edit a ‘nextflow.config’ file

Code Block
nano nextflow.config

This will create and open an empty text file in nano. Into this file copy and paste the following:

Code Block
params {
  classifier = "GTGYCAGCMGCCGCGGTAA-GGACTACNVGGGTWTCTAAT-gg_13_8-85-qiime2_2019.7-classifier.qza"
  metadata = "Metadata.tsv"
  readPaths = [
    ['1_S103', ['1_S103_L001_R1_001.fastq.gz', '1_S103_L001_R2_001.fastq.gz']],
    ['1a_S103', ['1a_S103_L001_R1_001.fastq.gz', '1a_S103_L001_R2_001.fastq.gz']],
    ['2_S115', ['2_S115_L001_R1_001.fastq.gz', '2_S115_L001_R2_001.fastq.gz']],
    ['2a_S115', ['2a_S115_L001_R1_001.fastq.gz', '2a_S115_L001_R2_001.fastq.gz']]
  ]
}

In nano you can then save the file by ‘ctrl o’ and then exit nano with ‘ctrl x’

Ampliseq test command

Run the following command to test the ampliseq pipeline.

Code Block
nextflow run nf-core/ampliseq -profile test,singularity --metadata "Metadata.tsv"

This will submit multiple PBS jobs, for each test file and analysis step. How fast this will run depends on how large the queue is for using the HPC. If there is no queue (rare) the test run will finish in approx 30 minutes. regardless, this does not need to be actively monitored. You can check later to see if the test run succeeded or failed. If it succeeded (you will see ‘Pipeline completed successfully’ message), continue with the analysis of your dataset. if it failed (multiple potential error messages), contact us at eResearch and we will work through the issue: eresearch@qut.edu.au.

You should see several output directories and files have been created in your ‘ampliseq_test’ directory. These contain the test analysis results. Have a look through these, as they are similar to the output from a full ampliseq run (i.e. on your dataset).

Need instructions on setting up NextFlow tower

Q for Craig:

Do we need to add any of this to .nextflow/config file? Perhaps just for Tower?

process {
executor = 'pbspro'
scratch = 'true'
beforeScript = {
"""
mkdir -p /data1/whatmorp/singularity/mnt/session
source $HOME/.bashrc
source $HOME/.profile
"""
}
}

singularity {
cacheDir = '/home/whatmorp/NXF_SINGULARITY_CACHEDIR'
autoMounts = true
}
conda {
cacheDir = '/home/whatmorp/NXF_CONDA_CACHEDIR'
}

tower {
accessToken = 'c7f8cc62b24155c0150a6ce4b6db15946bfc19ef'
endpoint = 'https://nftower.qut.edu.au/api'
enabled = true
}

Ampliseq output

As can be seen in the test results (see above section), ampliseq produces a ‘results’ directory with several subdirectories, which contain various analyses outputs. These are outlined here:

https://nf-co.re/ampliseq/1.1.3/output

Briefly, these include quality control and a variety of taxonomic diversity and abundance analyses, using QIIME2 and DADA2 as core analysis tools.

These directories contain various tables and figures that can be used in either downstream analysis or directly in publications.

Running ampliseq on your dataset

In this section we will focus primarily on the commands and files you need to run the pipeline on your data. A complete description of the ampliseq pipeline is on the ampliseq websites. To properly understand the ampliseq processes and analysis outputs, it is advisable that you thoroughly read through these.

https://nf-co.re/ampliseq

https://github.com/nf-core/ampliseq

Ampliseq requires the creation of some additional files and modification of parameters files in order to run on your dataset. Instructions below.

Directory structure and files

Make sure you have created a subdirectory in your ‘nextflow’ directory. Give it a meaningful name (e.g. mkdir <yourprojectname>_nextflow. Make sure you are in that directory (cd ~/nextflow/<yourprojectname>_nextflow).

As with the test run, you will need to download some datafiles and create some new files (manifest file, metadata file, nextflow.config file) to get ampliseq running on the HPC.

NOTE: be very careful about the naming and structure of these files. Sample IDs in the manifest and metadata files must match exactly and the file paths need to be correct. Column names must be named exactly as in the examples below (including case). Spelling errors, a stray comma or other character in these files is one of the more common reasons for ampliseq to fail

Taxonomic database

Download the silva database. This is the main database ampliseq uses for taxonomic classification.

Code Block
wget https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip

Manifest file

In the test run, a list of your filenames and the associated sample ID was included in the nextflow.config file. With many sample files it’s much easier to include this information in a separate manifest file.

This is a tab delimited file that contains 3 columns:

  1. ‘sampleID’, with the sample IDs. You can call these whatever you like, but it should be meaningful (e.g. groupA_1, groupA_2, groupb_1, groupB_2, etc)

  2. ‘forwardReads’. The full path for the forward reads. e.g. /home/myproject/fastq/sample1_S22_L001_R1.fastq.gz

  3. ‘reverseReads’. The full path for the forward reads. e.g. /home/myproject/fastq/sample1_S22_L001_R2.fastq.gz

This file can be created with Excel and then saved as a tab-delimited file (File → Save as → ‘manifest.txt’ → Text (Tab delimited)), then copied across to your NextFlow project directory (using WinSCP, Cyberduck, etc). Example:

...

sampleID

...

forwardReads

...

reverseReads

...

groupA_1

...

/home/myproject/fastq/sample1_S22_L001_R1.fastq.gz

...

/home/myproject/fastq/sample1_S22_L001_R2.fastq.gz

...

groupA_2

...

/home/myproject/fastq/sample2_S23_L001_R1.fastq.gz

...

/home/myproject/fastq/sample2_S23_L001_R2.fastq.gz

...

etc…

Creating a manifest file at the command line

As mentioned above, spelling mistakes or extra characters in the file paths will cause ampliseq to fail. One way to avoid this is to generate the manifest file on the command line using the tools awk and sed.

Below is an example of how to generate the manifest file. You may need to modify this, depending on how your files are named.

To create the manifest using awk, paste, sed:

  1. List all the fastq files in the directory (both read pairs)

Code Block
ls *_R1*.fastq.gz  -lh | awk '{print $9}' > read1

ls *_R2*.fastq.gz  -lh | awk '{print $9}' > read2

2. List the sample IDs. If the sample names are in the sample files, they can be extracted using sed. For example:

Code Block
cat read1 | sed 's/_S.*//' > ID

The sample file names in this case are like such: ‘Raw8h_S10_L001_R1_001.fastq.gz’

The sample ID is ‘Raw8h’. The above sed command strips the characters after ‘_S’, leaving just the ID name. Depending on how your sample files are named, you can create a list of your sample IDs by modifying the above sed command.

3. Paste these together with the sample file directory prepended and tab delimiters. Output as ‘manifest.txt’.

Code Block
paste ID read1 read2 | awk '{print $1 "\t" "/path/to/your/nextflow/myproject/fastq/" $2 "\t" "/path/to/your/nextflow/myproject/fastq/" $3}' > manifest.txt

Make sure you them manually add the 3 column names at the top of each column: ‘sampleID’, ‘forwardReads’ and ‘reverseReads’ (e.g. use a text editor like nano, or download the file and modify it in Excel, then re-upload it).

Finally, copy the created manifest.txt to the directory where you will be running ampliseq from.

Metadata file

This is a tab separated values file (.tsv) that is required by QIIME2 to compare taxonomic diversity with phenotype (e.g. how diversity varies per experimental treatment). It contains the same sample IDs found in the manifest file and a column for each category of metadata you have for the samples. This may include sequence barcodes, experimental treatment group (e.g. high fat vs low fat) and any other measurements taken, such as age, date collected, tissue type, sex, collection location, weight, length, etc, etc, etc). QIIME2 will compare every metadata column with taxonomic results, then calculate and plot correlations and diversity indices. See here for more details:

https://docs.qiime2.org/2019.10/tutorials/metadata/

To get a better idea of the structure and format of the metadata file, you can download an example file from here:

https://data.qiime2.org/2019.10/tutorials/moving-pictures/sample_metadata.tsv

This file can also be created in Excel and saved as a tab-delimited file. It can be any format, but .tsv is the default (File → Save as → ‘metadata.tsv’ → Text (Tab delimited))

nextflow.config file

Create a custom nextflow.config file

Code Block
module load nano
nano nextflow.config

Edit the newly created ‘nextflow.config’ file to contain something like the following:

Code Block
params {
    max_cpus=32
    max_memory=512.GB
    max_time = 48.h
    FW_primer = "CCTACGGGNGGCWGCAG"
    RV_primer = "GACTACHVGGGTATCTAATCC"
    metadata = "metadata.txt"
    manifest = "manifest.txt"
    reference_database = "Silva_132_release.zip"
    retain_untrimmed = true
}

NOTE: This is an example nextflow.config file. Don’t simply copy and paste the above. You’ll need to modify it to reflect the primers you used to generate your sequences. FW_primer = and RV_primer =

The remaining lines can stay the same, presuming that you called your metadata file 'metadata.txt' and you have all the files in the directory where you will be running ampliseq from.

Running NextFlow’s ampliseq pipeline

Code Block
qsub -I -S /bin/bash -l walltime=72:00:00 -l select=1:ncpus=4:mem=8gb
Code Block
module load java
Code Block
cd nextflow/mahsa_illumina_16S/

params {
max_cpus=32
max_memory=512.GB
max_time = 48.h
FW_primer = "TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG"
RV_primer = "GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC"
metadata = "metadata.txt"
reference_database = "/home/whatmorp/Annette/16S/Silva_132_release.zip"
retain_untrimmed = true
}

  • Pointing to the place I DL Silva DB to originally

  • Pointing to Mahsa’s fastq files dir

  • metadata.txt file and manifest.tst files already created (need to add details here)

Running with

Code Block
nextflow run nf-core/ampliseq -profile singularity --manifest manifest.txt

Block: takes a very long time and often fails during classifier step.

Solution: run classifier manually with QIIME and point to output file in nextflow.config

https://docs.qiime2.org/2020.11/tutorials/feature-classifier/

This is the NexFlow script for the job:

Code Block
export HOME="${PWD}/HOME"

		unzip -qq Silva_132_release.zip

		fasta="SILVA_132_QIIME_release/rep_set/rep_set_16S_only/99/silva_132_99_16S.fna"
		taxonomy="SILVA_132_QIIME_release/taxonomy/16S_only/99/consensus_taxonomy_7_levels.txt"

		if [ "false" = "true" ]; then
			sed 's/#//g' $taxonomy >taxonomy-99_removeHash.txt
			taxonomy="taxonomy-99_removeHash.txt"
			echo "
######## WARNING! The taxonomy file was altered by removing all hash signs!"
		fi

		### Import
		qiime tools import --type 'FeatureData[Sequence]' 			--input-path $fasta 			--output-path ref-seq-99.qza
		qiime tools import --type 'FeatureData[Taxonomy]' 			--input-format HeaderlessTSVTaxonomyFormat 			--input-path $taxonomy 			--output-path ref-taxonomy-99.qza

		#Extract sequences based on primers
		qiime feature-classifier extract-reads 			--i-sequences ref-seq-99.qza 			--p-f-primer TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG 			--p-r-primer GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC 			--o-reads TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG-GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC-99-ref-seq.qza 			--quiet

		#Train classifier
		qiime feature-classifier fit-classifier-naive-bayes 			--i-reference-reads TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG-GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC-99-ref-seq.qza 			--i-reference-taxonomy ref-taxonomy-99.qza 			--o-classifier TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG-GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC-99-classifier.qza 			--quiet

The input fast and taxonomy files the above script are pointing to are in the reference database file, Silva_132_release.zip

So to generate the classifier file manually:

...

Table of Contents

Overview

Nextflow is a pipeline engine that can take advantage of the batch nature of the HPC environment to efficiently and quickly run Bioinformatic workflows.

For more information about Nextflow, please visit Nextflow - A DSL for parallel and scalable computational pipelines

The ampliseq pipeline is a bioinformatics analysis pipeline used for 16S rRNA amplicon sequencing data. It combines multiple 16S analysis tools in a single pipeline, to produce a variety of statistical and quantitative outputs, that can be used in further analysis or in publications.

https://nf-co.re/ampliseq

https://github.com/nf-core/ampliseq

Installing NextFlow on your HPC account

NextFlow needs to be set up locally for each user account on the HPC. Instructions for installing and setting up NextFlow for your account are here:

Nextflow

Follow the instructions in the above link, then when you have successfully run the NextFlow test (nextflow run hello generates Hello world!, etc), then run a test of the ampliseq pipeline (next section)

Running your NextFlow pipelines

NextFlow should never be run on the head node on the HPC (i.e. the node you are automatically logged on to) but should instead be run on a different node using the PBS job scheduler. For instructions on submitting PBS jobs, see here:

Running PBS jobs on the HPC Confluence page

Directory structure

When a NextFlow pipeline is run, it generates multiple directories and output files. We therefore recommend you create a directory where you run all your NextFlow pipelines, so that you don’t have output directories and files scattered across your home directory.

This creates a ‘nextflow’ subdirectory in your home directory. You can then create individual subdirectories for each pipeline you run (e.g. cd nextflow, mkdir apliseq_test).

Code Block
cd ~
mkdir nextflow

Alternative to submitting PBS job: interactive session.

Run tmux first, so job keeps running when you log off.

Code Block
tmux

Interactive PBS session:

Code Block
qsub -I -S /bin/bash -l walltime=168:00:00 -l select=1:ncpus=4:mem=8gb

Then you can run all the following code.

Run an ampliseq test

This test is to see if NextFlow is installed correctly on your account and if you can run ampliseq. It uses a small built-in dataset to run ampliseq, running a full analysis and producing all the output directories and files.

If this test run fails, contact us at eResearch - eresearch@qut.edu.au - and we can examine the issue.

Note that running the test run on the HPC requires some additional steps to those listed on the ampliseq website. Primarily there are issues with ampliseq automatically downloading external files, so we need to download these locally, then change the ampliseq config file to point to these downloaded files.

Downloading test files

There are 3 files, or sets of files, that need to be downloaded first. The command to download each of these is included below.

First, go to your nextflow directory. We are assuming you called it ‘nextflow’ (see above ‘Running your NextFlow pipelines’ section for directory structure suggestions).

Go to your nextflow directory and create a subdirectory called ‘ampliseq_test’

Code Block
cd ~/nextflow
mkdir ampliseq_test
cd ampliseq_test

Download the 3 required datafile sets:

1. The metadata file.

Code Block
wget https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/Metadata.tsv

2. The taxonomic classifier file.

Code Block
wget https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/GTGYCAGCMGCCGCGGTAA-GGACTACNVGGGTWTCTAAT-gg_13_8-85-qiime2_2019.7-classifier.qza

3. The test datafiles (fastq files). Note that the first line below is a very long one - it just adds all the datafile download paths to a test file, which can then be used in the wget command below it.

Code Block
printf 'https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1_S103_L001_R1_001.fastq.gz\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1_S103_L001_R2_001.fastq.gz\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1a_S103_L001_R1_001.fastq.gz\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1a_S103_L001_R2_001.fastq.gz\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/2_S115_L001_R1_001.fastq.gz\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/2a_S115_L001_R2_001.fastq.gz\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/2_S115_L001_R2_001.fastq.gz\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/2a_S115_L001_R1_001.fastq.gz' > datafiles.txt
wget -i datafiles.txt

Ampliseq test config file

NextFlow pipelines have a series of default settings, which can be overridden by modifying a config file. The default config file for ampliseq points to downloadable datafile locations. As we’ve downloaded the datafiles locally, we need to modify the config file to point to these local files instead.

Make sure you are in the directory you created for this test (containing the downloaded test datafiles) - cd ~/nextflow/ampliseq_test

Load a text editor first. Nano will do.

Code Block
module load nano

Create and edit a ‘nextflow.config’ file

Code Block
nano nextflow.config

This will create and open an empty text file in nano. Into this file copy and paste the following:

Code Block
params {
  classifier = "GTGYCAGCMGCCGCGGTAA-GGACTACNVGGGTWTCTAAT-gg_13_8-85-qiime2_2019.7-classifier.qza"
  metadata = "Metadata.tsv"
  readPaths = [
    ['1_S103', ['1_S103_L001_R1_001.fastq.gz', '1_S103_L001_R2_001.fastq.gz']],
    ['1a_S103', ['1a_S103_L001_R1_001.fastq.gz', '1a_S103_L001_R2_001.fastq.gz']],
    ['2_S115', ['2_S115_L001_R1_001.fastq.gz', '2_S115_L001_R2_001.fastq.gz']],
    ['2a_S115', ['2a_S115_L001_R1_001.fastq.gz', '2a_S115_L001_R2_001.fastq.gz']]
  ]
}

In nano you can then save the file by ‘ctrl o’ and then exit nano with ‘ctrl x’

Ampliseq test command

Run the following command to test the ampliseq pipeline.

Code Block
nextflow run nf-core/ampliseq -profile test,singularity --metadata "Metadata.tsv"

This will submit multiple PBS jobs, for each test file and analysis step. How fast this will run depends on how large the queue is for using the HPC. If there is no queue (rare) the test run will finish in approx 30 minutes. regardless, this does not need to be actively monitored. You can check later to see if the test run succeeded or failed. If it succeeded (you will see ‘Pipeline completed successfully’ message), continue with the analysis of your dataset. if it failed (multiple potential error messages), contact us at eResearch and we will work through the issue: eresearch@qut.edu.au.

You should see several output directories and files have been created in your ‘ampliseq_test’ directory. These contain the test analysis results. Have a look through these, as they are similar to the output from a full ampliseq run (i.e. on your dataset).

Ampliseq output

As can be seen in the test results (see above section), ampliseq produces a ‘results’ directory with several subdirectories, which contain various analyses outputs. These are outlined here:

https://nf-co.re/ampliseq/1.1.3/output

Briefly, these include quality control and a variety of taxonomic diversity and abundance analyses, using QIIME2 and DADA2 as core analysis tools.

These directories contain various tables and figures that can be used in either downstream analysis or directly in publications.

Running ampliseq on your dataset

In this section we will focus primarily on the commands and files you need to run the pipeline on your data. A complete description of the ampliseq pipeline is on the ampliseq websites. To properly understand the ampliseq processes and analysis outputs, it is advisable that you thoroughly read through these.

https://nf-co.re/ampliseq

https://github.com/nf-core/ampliseq

Ampliseq requires the creation of some additional files and modification of parameters files in order to run on your dataset. Instructions below.

Directory structure and files

Make sure you have created a subdirectory in your ‘nextflow’ directory. Give it a meaningful name (e.g. mkdir <yourprojectname>_nextflow. Make sure you are in that directory (cd ~/nextflow/<yourprojectname>_nextflow).

As with the test run, you will need to download some datafiles and create some new files (manifest file, metadata file, nextflow.config file) to get ampliseq running on the HPC.

NOTE: be very careful about the naming and structure of these files. Sample IDs in the manifest and metadata files must match exactly and the file paths need to be correct. Column names must be named exactly as in the examples below (including case). Spelling errors, a stray comma or other character in these files is one of the more common reasons for ampliseq to fail

Taxonomic database

Download the silva database. This is the main database ampliseq uses for taxonomic classification.

Code Block
wget https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip

Manifest file

In the test run, a list of your filenames and the associated sample ID was included in the nextflow.config file. With many sample files it’s much easier to include this information in a separate manifest file.

This is a tab delimited file that contains 3 columns:

  1. ‘sampleID’, with the sample IDs. You can call these whatever you like, but it should be meaningful (e.g. groupA_1, groupA_2, groupb_1, groupB_2, etc)

  2. ‘forwardReads’. The full path for the forward reads. e.g. /home/myproject/fastq/sample1_S22_L001_R1.fastq.gz

  3. ‘reverseReads’. The full path for the forward reads. e.g. /home/myproject/fastq/sample1_S22_L001_R2.fastq.gz

This file can be created with Excel and then saved as a tab-delimited file (File → Save as → ‘manifest.txt’ → Text (Tab delimited)), then copied across to your NextFlow project directory (using WinSCP, Cyberduck, etc). Example:

sampleID

forwardReads

reverseReads

groupA_1

/home/myproject/fastq/sample1_S22_L001_R1.fastq.gz

/home/myproject/fastq/sample1_S22_L001_R2.fastq.gz

groupA_2

/home/myproject/fastq/sample2_S23_L001_R1.fastq.gz

/home/myproject/fastq/sample2_S23_L001_R2.fastq.gz

etc…

Creating a manifest file at the command line

As mentioned above, spelling mistakes or extra characters in the file paths will cause ampliseq to fail. One way to avoid this is to generate the manifest file on the command line using the tools awk and sed.

Below is an example of how to generate the manifest file. You may need to modify this, depending on how your files are named.

To create the manifest using awk, paste, sed:

  1. List all the fastq files in the directory (both read pairs)

Code Block
ls *_R1*.fastq.gz  -lh | awk '{print $9}' > read1

ls *_R2*.fastq.gz  -lh | awk '{print $9}' > read2

2. List the sample IDs. If the sample names are in the sample files, they can be extracted using sed. For example:

Code Block
cat read1 | sed 's/_S.*//' > ID

The sample file names in this case are like such: ‘Raw8h_S10_L001_R1_001.fastq.gz’

The sample ID is ‘Raw8h’. The above sed command strips the characters after ‘_S’, leaving just the ID name. Depending on how your sample files are named, you can create a list of your sample IDs by modifying the above sed command.

3. Paste these together with the sample file directory prepended and tab delimiters. Output as ‘manifest.txt’.

Code Block
paste ID read1 read2 | awk '{print $1 "\t" "/path/to/your/nextflow/myproject/fastq/" $2 "\t" "/path/to/your/nextflow/myproject/fastq/" $3}' > manifest.txt

Make sure you them manually add the 3 column names at the top of each column: ‘sampleID’, ‘forwardReads’ and ‘reverseReads’ (e.g. use a text editor like nano, or download the file and modify it in Excel, then re-upload it).

Finally, copy the created manifest.txt to the directory where you will be running ampliseq from.

Metadata file

This is a tab separated values file (.tsv) that is required by QIIME2 to compare taxonomic diversity with phenotype (e.g. how diversity varies per experimental treatment). It contains the same sample IDs found in the manifest file and a column for each category of metadata you have for the samples. This may include sequence barcodes, experimental treatment group (e.g. high fat vs low fat) and any other measurements taken, such as age, date collected, tissue type, sex, collection location, weight, length, etc, etc, etc). QIIME2 will compare every metadata column with taxonomic results, then calculate and plot correlations and diversity indices. See here for more details:

https://docs.qiime2.org/2019.10/installtutorials/native/ wget metadata/

To get a better idea of the structure and format of the metadata file, you can download an example file from here:

https://data.qiime2.org/distro/core/qiime2-2019.10-py36-linux-conda.yml conda env create -n qiime2-2019.10 --file qiime2-2019.10-py36-linux-conda.yml conda activate qiime2-2019.10 # You can test the QIIME2 installation by running: qiime --help # Make sure you are in an interactive PBS session with sufficient RAM (lots of RAM needed) qsub -I -S /bin/bash -l walltime=72:00:00 -l select=1:ncpus=16:mem=256gb # Unzip Silva_132_release.zip to access the required databases unzip Silva_132_release.zip

Running analysis on QUTs HPC

NextFlow is designed to run on a Linux server. It makes use of server clusters and batch systems such as PBS to split up analysis into multiple jobs, vastly streamlining and speeding up analysis time. To get the most of out NextFlow you should run it on QUTs high performance cluster (HPC). 

Accessing the HPC

QUT staff and HDR students can run NextFlow on QUTs high performance computing cluster.

If you haven’t been set up or have used the HPC previously, click on this link for information on how to get access to and use the HPC:

Need a link here for HPC access and usage 

Creating a shared workspace on the HPC

If you already have access to the HPC, we strongly recommend that you request a shared directory to store and analyse your data. This is so yourself, and relevant members of your research team (e.g. your HPC supervisor) and eResearch bioinformaticians can access your data and assist with the analysis.

To request a shared directory, submit a request to HPC support, telling them what you want the directory to be called (e.g. your_name_16S) and who you want to access it.

https://eresearchqut.atlassian.net/servicedesk/customer/portals 

Running your analysis using the Portable Batch System (PBS)

The HPC has multiple ‘nodes’ available, each of which are allocated certain amounts of RAM and CPU cores.

You should run your data on one of these nodes, with a suitable amount of RAM and cores for your analysis needs. When you log on to the HPC you automatically are in the ‘head node’.

Do not run any of your analysis on the head node. Always request a node through the PBS.

To request a node using PBS, submit a shell script containing your RAM/CPU/analysis time requirements and the code needed to run your analysis. For an overview of submitting a PBS job, see here:

Need a link here for creating PBS jobs

Alternatively, you can start up an ‘interactive’ node, using the following:

Code Block
qsub -I -S /bin/bash -l walltime=72:00:00 -l select=1:ncpus=4:mem=8gb

This asks for a node with 4 CPUs and 8GB of memory, that will run for 72hrs. This may seem inadequate for a full analysis, but NextFlow will automatically start PBS sessions and allocate them sufficient memory for each analysis step.

Note that when you request a node, you are put into a queue until a node of those specifications becomes available. Depending on how many people are using the HPC and how much resources they are using, this may take a while, in which case you just need to wait until the job starts (“waiting for job xxxx.pbs to start”).

Once the interactive PBS job begins (you get a command prompt) you can run the following analysis:/tutorials/moving-pictures/sample_metadata.tsv

This file can also be created in Excel and saved as a tab-delimited file. It can be any format, but .tsv is the default (File → Save as → ‘metadata.tsv’ → Text (Tab delimited))

nextflow.config file

Create a custom nextflow.config file

Code Block
module load nano
nano nextflow.config

Edit the newly created ‘nextflow.config’ file to contain something like the following:

Code Block
params {
    max_cpus=32
    max_memory=512.GB
    max_time = 48.h
    FW_primer = "CCTACGGGNGGCWGCAG"
    RV_primer = "GACTACHVGGGTATCTAATCC"
    metadata = "metadata.txt"
    manifest = "manifest.txt"
    reference_database = "Silva_132_release.zip"
    retain_untrimmed = true
}

NOTE: This is an example nextflow.config file. Don’t simply copy and paste the above. You’ll need to modify it to reflect the primers you used to generate your sequences. FW_primer = and RV_primer =

The remaining lines can stay the same, presuming that you called your metadata file 'metadata.txt' and you have all the files in the directory where you will be running ampliseq from.

Running NextFlow’s ampliseq pipeline

Make sure Java is loaded (should be already loaded if you are continuing from the above steps, otherwise ‘module load java’) and that you have started an interactive PBS session (again, you should be in this if continuing from above)

Code Block
nextflow run nf-core/ampliseq -profile singularity

Running the full pipeline may take a day or so, depending on how many people are using the HPC. If ampliseq fails for whatever reason (you’ll see red script and the error message in your console) it’s a good idea to resume the ampliseq run a couple of times, just to make sure it’s not a transitory error, which occurs sometimes. To do this, add --resume to the end of the command.

Code Block
nextflow run nf-core/ampliseq -profile singularity --resume

If ampliseq continues to fail after you have run --resume a couple of times, contact us at eResearch support: eresearch@qut.edu.au

Results

Ampliseq generates multiple output directories in a main ‘Results’ directory, including tables, figures, analysis results, etc. See here for details:

https://nf-co.re/ampliseq/1.1.3/output

These results can be used for further downstream analysis - for example plotting the table of taxon abundance - in Excel, R or other programs.

An example of how to analyse the results in R is here:

Downstream analysis of NextFlow ampliseq output (16S amplicon analysis)