Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

PacBio 16S amplicons can be analysed with standard 16S analysis pipelines (QIIME2, Mothar, etc) but the sequences need to have some initial preparation that Illumina data (which can natively be run in QIIME2, Mothar, etc) does not.

The reason for this is twofold:

  1. PacBio reads are generally much longer than Illumina - a range between 10,000 - 25,000 bases rather than Illumina’s fixed-length ~70 - 350 bases, and:

  2. Error rates of PacBio data are higher than Illumina.

Fortunately, PacBio has developed a technology that uses both of these differences to overcome the high error-rate problem: Single Molecule Real Time Sequencing.

The long PacBio reads span the target sequence (e.g. full length 16S) many times. This allows a cross-validated consensus sequence to be generated that corrects for sequencing errors.

After

Installing SMRTLink command-line tools

PacBio SMRT data file structure

When you first get access to your PacBio data, you’ll notice there are multiple files per sample or sample pool (unlike Illumina where you will likely only see one fastq file per sample).

Analysis overview

As this analysis represents some preparatory steps to run the data in a full 16S analysis pipeline, it only involves two main analysis steps:

  1. Circular Consensus Sequence (CCS) generation

  2. Demultiplexing

See the below sections for details.

Note that the end result of this analysis is the generation of bam files containing one error-corrected 16S sequence per sample. These can be then run in a standard downstream analysis pipeline, such as QIIME2.

In addition, eResearch have developed a full 16S amplicon analysis pipeline using NextFlow. We recommend you use the bam files generated from this PacBio analysis as input for this pipeline:

Insert link here for ampliseq pipeline

Running analysis on QUTs HPC

The PacBio data files are large (multiple gigabytes) and the SMRT tools are designed to run on a Linux server, thus a high memory, multi core Linux server is recommended to run this analysis.

Accessing the HPC

QUT staff and HDR students can analyse their PacBio SMRT data on QUTs high performance computing cluster.

If you haven’t been set up or have used the HPC previously, click on this link for information on how to get access to and use the HPC:

Need a link here for HPC access and usage

Creating a shared workspace on the HPC

If you already have access to the HPC, we strongly recommend that you request a shared directory to store and analyse your data. This is so yourself, and relevant members of your research team (e.g. your HPC supervisor) and eResearch bioinformaticians can access your data and assist with the analysis.

To request a shared directory, submit a request to HPC support, telling them what you want the directory to be called (e.g. your_name_16S) and who you want to access it.

https://eresearchqut.atlassian.net/servicedesk/customer/portals

Running your analysis using the Portable Batch System (PBS)

The HPC has multiple ‘nodes’ available, each of which are allocated certain amounts of RAM and CPU cores.

You should run your data on one of these nodes, with a suitable amount of RAM and cores for your analysis needs. When you log on to the HPC you automatically are in the ‘head node’.

Do not run any of your analysis on the head node. Always request a node through the PBS.

To request a node using PBS, submit a shell script containing your RAM/CPU/analysis time requirements and the code needed to run your analysis. For an overview of submitting a PBS job, see here:

Need a link here for creating PBS jobs

Alternatively, you can start up an ‘interactive’ node, using the following:

Code Block
languagebash
qsub -I -S /bin/bash -l walltime=72:00:00 -l select=1:ncpus=32:mem=64gb

This asks for a node with 32 CPUs and 64GB of memory, that will run for 72hrs. From our testing this should be adequate to run the following analysis on a PacBio SMRT dataset.

Note that when you request a node, you are put into a queue until a node of those specifications becomes available. Depending on how many people are using the HPC and how much resources they are using, this may take a while, in which case you just need to wait until the job starts (“waiting for job xxxx.pbs to start”).

Once the interactive PBS job begins (you get a command prompt) you can run the following analysis:

Generating consensus sequences

Use the SMRT Tools command-line tool ‘ccs’.

Code Block
ccs -j 16 --min-rq 0.999 <sample_name>.subreadset.xml <output_file>.bam --report-file <sample_name>_ccs_report.txt

Note the --min-rq 0.999 . This tells ccs to filter for Q30, i.e. 1 error in 1000. By default ccs filters for Q20, i.e. 1 error in 100. Using 16S we are examining genetic differences between species, and these differences can be just a few base pairs, thus a Q30 error rate is a better option than Q20.

In the above code enter the xml filename associated with your raw data (bam) file. As mentioned in the ‘PacBio SMRT data file structure’ section above, PacBio generates multiple files per sample (or pooled samples). The xml file lists all the files associated with one sample.

Give the output file a meaningful name, e.g. ‘<sample_name>_.ccs.bam’

ccs generates a report file, called ‘ccs_report.txt’ by default. You should give it a unique name by adding --report-file <sample_name>_ccs_report.txt to the ccs comman (or else when you run ccs on another sample the original ccs_report.txt will be overwritten).

Demultiplexing