PacBio SMRT full length 16S analysis

Overview

PacBio 16S amplicons can be analysed with standard 16S analysis pipelines (QIIME2, Mothar, etc) but the sequences need to have some initial preparation that Illumina data (which can natively be run in QIIME2, Mothar, etc) does not.

The reason for this is twofold:

PacBio reads are generally much longer than Illumina - a range between 10,000 - 25,000 bases rather than Illumina’s fixed-length ~70 - 350 bases, and:
Error rates of PacBio data are higher than Illumina.

Fortunately, PacBio has developed a technology that uses both of these differences to overcome the high error-rate problem: Single Molecule Real Time Sequencing.

The long PacBio reads span the target sequence (e.g. full length 16S) many times. This allows a cross-validated consensus sequence to be generated that corrects for sequencing errors.

Basic pipeline

This section provides a skeleton overview of the pipeline commands. if you’ve installed the tools and run this pipeline before, you can simply run these commands on your data. if this is your first time running this analysis, it’s strongly recommended you read the following sections of this article, as they provide important background information.

Start a PBS interactive session

qsub -I -S /bin/bash -l walltime=72:00:00 -l select=1:ncpus=32:mem=64gb

2. Start tmux, so job will continue to run after you close your terminal window

tmux

3. Generate circular consensus sequences

ccs -j 16 --min-rq 0.999 <path-to-sample-xml-file> <output_file>.bam --report-file <sample_name>_ccs_report.txt

Installing SMRTLink command-line tools

Most SMRTLink tutorials use the web interface, but in this pipeline we use the SMRTLink command-line tools. An overview of SMRTLink tools is here:

https://www.pacb.com/wp-content/uploads/SMRT_Tools_Reference_Guide_v600.pdf

Download the latest SMRTLink version from from here, to your HPC home directory in a directory of your choice:

https://www.pacb.com/support/software-downloads/

E.g. using wget to download version 9.0.0 (this may not be the latest version, make sure you check the above link):

wget https://downloads.pacbcloud.com/public/software/installers/smrtlink_9.0.0.92188.zip

Unzip the file (with correct version filename)

unzip smrtlink_*.zip

This will create an executable *.run file, used for installing the tools. To install only the command-line tools:

smrtlink-*.run --rootdir smrtlink --smrttools-only

Use the version of the .run file you downloaded and tell it where to install the tools using the ‘--rootdir’ parameter. For example, if you downloaded version 9.0.0 and you wanted to install it to your /bin/smrtlink directory (make sure you create that directory first - mkdir ~/bin/smrtlink - your command would be:

smrtlink_9.0.0.92188.run --rootdir ~/bin/smrtlink --smrttools-only

Then command-line tools are now in ~/bin/smrtlink/smrtcmds/bin . You can add this path to your bashrc file, like so https://www.tecmint.com/set-path-variable-linux-permanently/. You don’t have to do this, but this means you’ll have to put the full path for the SMRTLink commands listed in the sections below. For example, where I have used the ccs command, you’ll need to instead run it as ~/bin/smrtlink/smrtcmds/bin/ccs.

PacBio SMRT data file structure

When you first get access to your PacBio data, you’ll notice there are multiple files per sample or sample pool (unlike Illumina where you will likely only see one fastq file per sample).

In the <filename>.transferdone file you’ll see a list of the files for each sample, transferred from the PacBio sequencing machine. The files are:

<filename>.run.metadata.xml (hidden file)
<filename>.metadata.xml (hidden file)
<filename>.subreadset.xml
<filename>.subreads.bam
<filename>.subreads.bam.pbi
<filename>.scraps.bam
<filename>.scraps.bam.pbi
<filename>.adapters.fasta
<filename>.sts.xml
<filename>.baz2bam_1.log

The key file here is the <filename>.subreadset.xml file, which contains various information on the sequencing run primers used, etc) as well as the data filenames. This xml file can be used as input into any of the SMRTLink command line tools, and will subsequently link to all the relevant data files and information needed to run the analysis.

Note that the <filename>.subreads.bam is the main analysis-ready datafile and the <filename>.scraps.bam contains the removed adapters, barcodes and rejected reads (the related .bam.pbi files are the index file for these).

More details can be found here:

https://pacbiofileformats.readthedocs.io/en/3.0/BAM.html

or here:

https://www.pacb.com/wp-content/uploads/SMRT-Link-Getting-Started-Guide-v4.0.0.pdf

Analysis overview

As this analysis represents some preparatory steps to run the data in a full 16S analysis pipeline, it only involves two main analysis steps:

Circular Consensus Sequence (CCS) generation
Demultiplexing

See the below sections for details.

Note that the end result of this analysis is the generation of bam files containing one error-corrected 16S sequence per sample. These can be then run in a standard downstream analysis pipeline, such as QIIME2.

In addition, eResearch have developed a full 16S amplicon analysis pipeline using NextFlow. We recommend you use the bam files generated from this PacBio analysis as input for this pipeline:

16S amplicon analysis pipeline (NextFlow Ampliseq)

Running analysis on QUTs HPC

The PacBio data files are large (multiple gigabytes) and the SMRT tools are designed to run on a Linux server, thus a high memory, multi core Linux server is recommended to run this analysis.

Accessing the HPC

QUT staff and HDR students can analyse their PacBio SMRT data on QUTs high performance computing cluster.

If you haven’t been set up or have used the HPC previously, click on this link for information on how to get access to and use the HPC:

Need a link here for HPC access and usage

Creating a shared workspace on the HPC

If you already have access to the HPC, we strongly recommend that you request a shared directory to store and analyse your data. This is so yourself, and relevant members of your research team (e.g. your HPC supervisor) and eResearch bioinformaticians can access your data and assist with the analysis.

To request a shared directory, submit a request to HPC support, telling them what you want the directory to be called (e.g. your_name_16S) and who you want to access it.

https://eresearchqut.atlassian.net/servicedesk/customer/portals

Running your analysis using the Portable Batch System (PBS)

The HPC has multiple ‘nodes’ available, each of which are allocated certain amounts of RAM and CPU cores.

You should run your data on one of these nodes, with a suitable amount of RAM and cores for your analysis needs. When you log on to the HPC you automatically are in the ‘head node’.

Do not run any of your analysis on the head node. Always request a node through the PBS.

To request a node using PBS, submit a shell script containing your RAM/CPU/analysis time requirements and the code needed to run your analysis. For an overview of submitting a PBS job, see here:

Need a link here for creating PBS jobs

Alternatively, you can start up an ‘interactive’ node, using the following:

qsub -I -S /bin/bash -l walltime=72:00:00 -l select=1:ncpus=32:mem=64gb

This asks for a node with 32 CPUs and 64GB of memory, that will run for 72hrs. From our testing this should be adequate to run the following analysis on a PacBio SMRT dataset.

Note that when you request a node, you are put into a queue until a node of those specifications becomes available. Depending on how many people are using the HPC and how much resources they are using, this may take a while, in which case you just need to wait until the job starts (“waiting for job xxxx.pbs to start”).

Once the interactive PBS job begins (you get a command prompt) you can run the following analysis:

Generating consensus sequences

Use the SMRT Tools command-line tool ‘ccs’.

ccs -j 16 --min-rq 0.999 <path-to-sample-xml-file> <output_file>.bam --report-file <sample_name>_ccs_report.txt

Note the --min-rq 0.999 . This tells ccs to filter for Q30, i.e. 1 error in 1000. By default ccs filters for Q20, i.e. 1 error in 100. Using 16S we are examining genetic differences between species, and these differences can be just a few base pairs, thus a Q30 error rate is a better option than Q20.

In the above code enter the path to the xml filename (<path-to-sample-xml-file>) associated with your raw data (bam) file. This will be a file called <sample_name>.subreadset.xml. As mentioned in the ‘PacBio SMRT data file structure’ section above, PacBio generates multiple files per sample (or pooled samples). The xml file lists all the files associated with one sample.

Give the output file a meaningful name, e.g. ‘samples_pool_1.ccs.bam’

ccs generates a report file, called ‘ccs_report.txt’ by default. You should give it a unique name by adding --report-file <sample_name>_ccs_report.txt (e.g. ‘samples_pool_1_ccs_report.txt') to the ccs command, or else when you run ccs on another sample the previous ccs_report.txt will be overwritten.

ccs with Q20 (default)

	m54105_201123_035001	m54105_201125_211823	m54105_201125_005856	m54105_201126_173825
ZMWs input (A)	330737	263116	217071
ZMWs generating CCS (B)	251126 (75.93%)	208122 (79.10%)	168434 (77.59%)	138080 (80.85%)
ZMWs filtered (C)	79611 (24.07%)	54994 (20.90%)	48637 (22.41%)	32713 (19.15%)
Exclusive ZMW counts for (C)
Median length filter	0 (0.00%)	0 (0.00%)	0 (0.00%)	0 (0.00%)
Below SNR threshold	0 (0.00%)	0 (0.00%)	0 (0.00%)	0 (0.00%)
Lacking full passes	54483 (68.44%)	39690 (72.17%)	32964 (67.78%)	22249 (68.01%)
Heteroduplex insertions	470 (0.59%)	306 (0.56%)	280 (0.58%)	191 (0.58%)
Coverage drops	82 (0.10%)	34 (0.06%)	53 (0.11%)	32 (0.10%)
Insufficient draft cov	1045 (1.31%)	889 (1.62%)	635 (1.31%)	610 (1.86%)
Draft too different	809 (1.02%)	337 (0.61%)	430 (0.88%)	160 (0.49%)
Draft generation error	3212 (4.03%)	1570 (2.85%)	1767 (3.63%)	1354 (4.14%)
Draft above --max-length	2 (0.00%)	1 (0.00%)	0 (0.00%)	0 (0.00%)
Draft below --min-length	0 (0.00%)	0 (0.00%)	0 (0.00%)	0 (0.00%)
Reads failed polishing	0 (0.00%)	0 (0.00%)	0 (0.00%)	0 (0.00%)
Empty coverage windows	14 (0.02%)	13 (0.02%)	11 (0.02%)	10 (0.03%)
CCS did not converge	0 (0.00%)	0 (0.00%)	0 (0.00%)	1 (0.00%)
CCS below minimum RQ	19496 (24.49%)	12155 (22.10%)	12497 (25.69%)	8106 (24.78%)
Unknown error	0 (0.00%)	0 (0.00%)	0 (0.00%)	0 (0.00%)

Q30

	m54105_201123_035001	m54105_201125_211823	m54105_201125_005856	m54105_201126_173825
ZMWs input (A)	330737	263116	217071	170793
ZMWs generating CCS (B)	209855 (63.45%)	177981 (67.64%)	138732 (63.91%)	116886 (68.44%)
ZMWs filtered (C)	120882 (36.55%)	85135 (32.36%)	78339 (36.09%)	53907 (31.56%)

Exclusive ZMW counts for (C)
Median length filter	0 (0.00%)	0 (0.00%)	0 (0.00%)	0 (0.00%)
Below SNR threshold	0 (0.00%)	0 (0.00%)	0 (0.00%)	0 (0.00%)
Lacking full passes	54483 (45.07%)	39690 (46.62%)	32964 (42.08%)	22249 (41.27%)
Heteroduplex insertions	470 (0.39%)	306 (0.36%)	280 (0.36%)	191 (0.35%)
Coverage drops	82 (0.07%)	34 (0.04%)	53 (0.07%)	32 (0.06%)
Insufficient draft cov	1045 (0.86%)	889 (1.04%)	635 (0.81%)	610 (1.13%)
Draft too different	809 (0.67%)	337 (0.40%)	430 (0.55%)	160 (0.30%)
Draft generation error	3212 (2.66%)	1570 (1.84%)	1767 (2.26%)	1354 (2.51%)
Draft above --max-length	2 (0.00%)	1 (0.00%)	0 (0.00%)	0 (0.00%)
Draft below --min-length	0 (0.00%)	0 (0.00%)	0 (0.00%)	0 (0.00%)
Reads failed polishing	0 (0.00%)	0 (0.00%)	0 (0.00%)	0 (0.00%)
Empty coverage windows	14 (0.01%)	13 (0.02%)	11 (0.01%)	10 (0.02%)
CCS did not converge	0 (0.00%)	0 (0.00%)	0 (0.00%)	1 (0.00%)
CCS below minimum RQ	60767 (50.27%)	42296 (49.68%)	42199 (53.87%)	29300 (54.35%)
Unknown error	0 (0.00%)	0 (0.00%)	0 (0.00%)	0 (0.00%)