Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

Overview

Nextflow is a pipeline engine that can take advantage of the batch nature of the HPC environment to efficiently and quickly run Bioinformatic workflows.

...

https://github.com/nf-core/ampliseq

Installing NextFlow on your HPC account

NextFlow needs to be set up locally for each user account on the HPC. Instructions for installing and setting up NextFlow for your account are here:

Nextflow NextFlow quick start

Follow the instructions in the above link, then when you have successfully run the NextFlow test (nextflow run hello generates Hello world!, etc), then run a test of the ampliseq pipeline (next section)

Running your NextFlow pipelines

NextFlow should never be run on the head node on the HPC (i.e. the node you are automatically logged on to) but should instead be run on a different node using the PBS job scheduler. For instructions on submitting PBS jobs, see here:

...

Note: the wiki page for running PBS jobs is in development. Instead, run an interactive PBS session, as seen below in the ‘Alternative to submitting PBS job: interactive session.’ section

Directory structure

When a NextFlow pipeline is run, it generates multiple directories and output files. We therefore recommend you create a directory where you run all your NextFlow pipelines, so that you don’t have output directories and files scattered across your home directory.

...

Then you can run all the following code.

Run an ampliseq test

This test is to see if NextFlow is installed correctly on your account and if you can run ampliseq. It uses a small built-in dataset to run ampliseq, running a full analysis and producing all the output directories and files.

...

Note that running the test run on the HPC requires some additional steps to those listed on the ampliseq website. Primarily there are issues with ampliseq automatically downloading external files, so we need to download these locally, then change the ampliseq config file to point to these downloaded files.

Downloading test files

There are 3 files, or sets of files, that need to be downloaded first. The command to download each of these is included below.

...

Code Block
printf 'https://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1_S103_L001_R1_001.fastq.gz\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1_S103_L001_R2_001.fastq.gz\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1a_S103_L001_R1_001.fastq.gz\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/1a_S103_L001_R2_001.fastq.gz\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/2_S115_L001_R1_001.fastq.gz\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/2a_S115_L001_R2_001.fastq.gz\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/2_S115_L001_R2_001.fastq.gz\nhttps://github.com/nf-core/test-datasets/raw/ampliseq/testdata/2a_S115_L001_R1_001.fastq.gz' > datafiles.txt
wget -i datafiles.txt

Ampliseq test config file

NextFlow pipelines have a series of default settings, which can be overridden by modifying a config file. The default config file for ampliseq points to downloadable datafile locations. As we’ve downloaded the datafiles locally, we need to modify the config file to point to these local files instead.

...

In nano you can then save the file by ‘ctrl o’ and then exit nano with ‘ctrl x’

Ampliseq test command

Run the following command to test the ampliseq pipeline.

...

You should see several output directories and files have been created in your ‘ampliseq_test’ directory. These contain the test analysis results. Have a look through these, as they are similar to the output from a full ampliseq run (i.e. on your dataset).

Ampliseq output

As can be seen in the test results (see above section), ampliseq produces a ‘results’ directory with several subdirectories, which contain various analyses outputs. These are outlined here:

...

These directories contain various tables and figures that can be used in either downstream analysis or directly in publications.

Running ampliseq on your dataset

In this section we will focus primarily on the commands and files you need to run the pipeline on your data. A complete description of the ampliseq pipeline is on the ampliseq websites. To properly understand the ampliseq processes and analysis outputs, it is advisable that you thoroughly read through these.

...

Ampliseq requires the creation of some additional files and modification of parameters files in order to run on your dataset. Instructions below.

Directory structure and files

Make sure you have created a subdirectory in your ‘nextflow’ directory. Give it a meaningful name (e.g. mkdir <yourprojectname>_nextflow. Make sure you are in that directory (cd ~/nextflow/<yourprojectname>_nextflow).

...

NOTE: be very careful about the naming and structure of these files. Sample IDs in the manifest and metadata files must match exactly and the file paths need to be correct. Column names must be named exactly as in the examples below (including case). Spelling errors, a stray comma or other character in these files is one of the more common reasons for ampliseq to fail

Taxonomic database

Download the silva database. This is the main database ampliseq uses for taxonomic classification.

Code Block
wget https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip

Manifest file

In the test run, a list of your filenames and the associated sample ID was included in the nextflow.config file. With many sample files it’s much easier to include this information in a separate manifest file.

This is a tab delimited file that contains 3 columns:

  1. ‘sampleID’, with the sample IDs. You can call these whatever you like, but it should be meaningful (e.g. groupA_1, groupA_2, groupb_1, groupB_2, etc)

  2. ‘forwardReads’. The full path for the forward reads. e.g. /home/myproject/fastq/sample1_S22_L001_R1.fastq.gz

  3. ‘reverseReads’. The full path for the forward reads. e.g. /home/myproject/fastq/sample1_S22_L001_R2.fastq.gz

This file can be created with Excel and then saved as a tab-delimited file (File → Save as → ‘manifest.txt’ → Text (Tab delimited)), then copied across to your NextFlow project directory (using WinSCP, Cyberduck, etc). Example:

sampleID

forwardReads

reverseReads

groupA_1

/home/myproject/fastq/sample1_S22_L001_R1.fastq.gz

/home/myproject/fastq/sample1_S22_L001_R2.fastq.gz

groupA_2

/home/myproject/fastq/sample2_S23_L001_R1.fastq.gz

/home/myproject/fastq/sample2_S23_L001_R2.fastq.gz

etc…

Creating a manifest file at the command line

As mentioned above, spelling mistakes or extra characters in the file paths will cause ampliseq to fail. One way to avoid this is to generate the manifest file on the command line using the Linux tools awk and sed.

...

To create the manifest using awk, paste, sed:

  1. List all the fastq files in the directory (both read pairs)

Code Block
ls *_R1*.fastq.gz  -lh | awk '{print $9}' > gz > read1

ls *_R2*.fastq.gz  -lh | awk '{print $9}' > gz > read2

2. List the sample IDs. If the sample names are in the sample files, they can be extracted using sed. For example:

...

Finally, copy the created manifest.txt to the directory where you will be running ampliseq from.

Metadata file

This is a tab separated values file (.tsv) that is required by QIIME2 to compare taxonomic diversity with phenotype (e.g. how diversity varies per experimental treatment). It contains the same sample IDs found in the manifest file and a column for each category of metadata you have for the samples. This may include sequence barcodes, experimental treatment group (e.g. high fat vs low fat) and any other measurements taken, such as age, date collected, tissue type, sex, collection location, weight, length, etc, etc, etc). QIIME2 will compare every metadata column with taxonomic results, then calculate and plot correlations and diversity indices. See here for more details:

...

This file can also be created in Excel and saved as a tab-delimited file. It can be any format, but .tsv is the default (File → Save as → ‘metadata.tsv’ → Text (Tab delimited))

nextflow.config file

Create a custom nextflow.config file

...

The remaining lines can stay the same, presuming that you called your metadata file 'metadata.txt' and you have all the files in the directory where you will be running ampliseq from.

Notes on amplicon primers

There are multiple sets of amplicon primers, designed to amplify different regions of the 16S gene. You should be told by your sequencing company what these primers are.

...

Again, this is only for the Illumina 16S V3 and V4 region amplicons. If you’ve amplified a different region, you’ll need to provide different primers. If you’re using Illumina, look out for overhang sequences!

Running ampliseq on pacbio data

ampliseq is designed for paired-end Illumina data, but can be run on single-end pacbio data with a few modifications:

Manifest file

Code Block
ls *.fastq.gz > read1
cat read1 | sed 's/.fastq.gz//' > ID
paste ID read1 read2 | awk '{print $1 "\t" "/home/whatmorp/nextflow/pacbio_test/fastq/" $2}' > manifest.txt

A paired-end manifest requires exactly ‘sampleID forwardReads reverseReads’ as column names.

For single end just use ‘sampleID Reads'.

See line 349 of the main.nf file for single_end samples: .map { row -> [ row.sampleID, file(row.Reads, checkIfExists: true) ] } compared to the default paired-end line: .map { row -> [ row.sampleID, [ file(row.forwardReads, checkIfExists: true), file(row.reverseReads, checkIfExists: true) ] ] }

https://github.com/nf-core/ampliseq/blob/master/main.nf

nextflow.config

Add a line ‘pacbio = true’

This tells ampliseq to run using single_end parameters and also changes some of the DADA2 parameters.

Running NextFlow’s ampliseq pipeline

Make sure Java is loaded (should be already loaded if you are continuing from the above steps, otherwise ‘module load java’) and that you have started an interactive PBS session (again, you should be in this if continuing from above)

...

If ampliseq continues to fail after you have run --resume a couple of times, contact us at eResearch support: eresearch@qut.edu.au

Results

Ampliseq generates multiple output directories in a main ‘Results’ directory, including tables, figures, analysis results, etc. See here for details:

...