PacBio SMRT full length 16S analysis
Overview
PacBio 16S amplicons can be analysed with standard 16S analysis pipelines (QIIME2, Mothar, etc) but the sequences need to have some initial preparation that Illumina data (which can natively be run in QIIME2, Mothar, etc) does not.
The reason for this is twofold:
PacBio reads are generally much longer than Illumina - a range between 10,000 - 25,000 bases rather than Illumina’s fixed-length ~70 - 350 bases, and:
Error rates of PacBio data are higher than Illumina.
Fortunately, PacBio has developed a technology that uses both of these differences to overcome the high error-rate problem: Single Molecule Real Time Sequencing.
The long PacBio reads span the target sequence (e.g. full length 16S) many times. This allows a cross-validated consensus sequence to be generated that corrects for sequencing errors.
Basic pipeline
This section provides a skeleton overview of the pipeline commands. if you’ve installed the tools and run this pipeline before, you can simply run these commands on your data. if this is your first time running this analysis, it’s strongly recommended you read the following sections of this article, as they provide important background information.
Start a PBS interactive session
qsub -I -S /bin/bash -l walltime=72:00:00 -l select=1:ncpus=32:mem=64gb
2. Start tmux, so job will continue to run after you close your terminal window
tmux
3. Generate circular consensus sequences
ccs -j 16 --min-rq 0.999 <path-to-sample-xml-file> <output_file>.bam --report-file <sample_name>_ccs_report.txt
Installing SMRTLink command-line tools
Most SMRTLink tutorials use the web interface, but in this pipeline we use the SMRTLink command-line tools. An overview of SMRTLink tools is here:
https://www.pacb.com/wp-content/uploads/SMRT_Tools_Reference_Guide_v600.pdf
Download the latest SMRTLink version from from here, to your HPC home directory in a directory of your choice:
https://www.pacb.com/support/software-downloads/
E.g. using wget to download version 9.0.0 (this may not be the latest version, make sure you check the above link):
wget https://downloads.pacbcloud.com/public/software/installers/smrtlink_9.0.0.92188.zip
Unzip the file (with correct version filename)
unzip smrtlink_*.zip
This will create an executable *.run file, used for installing the tools. To install only the command-line tools:
smrtlink-*.run --rootdir smrtlink --smrttools-only
Use the version of the .run file you downloaded and tell it where to install the tools using the ‘--rootdir’ parameter. For example, if you downloaded version 9.0.0 and you wanted to install it to your /bin/smrtlink directory (make sure you create that directory first - mkdir ~/bin/smrtlink
- your command would be:
Then command-line tools are now in ~/bin/smrtlink/smrtcmds/bin
. You can add this path to your bashrc file, like so https://www.tecmint.com/set-path-variable-linux-permanently/. You don’t have to do this, but this means you’ll have to put the full path for the SMRTLink commands listed in the sections below. For example, where I have used the ccs
command, you’ll need to instead run it as ~/bin/smrtlink/smrtcmds/bin/ccs
.
PacBio SMRT data file structure
When you first get access to your PacBio data, you’ll notice there are multiple files per sample or sample pool (unlike Illumina where you will likely only see one fastq file per sample).
In the <filename>.transferdone
file you’ll see a list of the files for each sample, transferred from the PacBio sequencing machine. The files are:
<filename>.run.metadata.xml (hidden file)
<filename>.metadata.xml (hidden file)
<filename>.subreadset.xml
<filename>.subreads.bam
<filename>.subreads.bam.pbi
<filename>.scraps.bam
<filename>.scraps.bam.pbi
<filename>.adapters.fasta
<filename>.sts.xml
<filename>.baz2bam_1.log
The key file here is the <filename>.subreadset.xml
file, which contains various information on the sequencing run primers used, etc) as well as the data filenames. This xml file can be used as input into any of the SMRTLink command line tools, and will subsequently link to all the relevant data files and information needed to run the analysis.
Note that the <filename>.subreads.bam
is the main analysis-ready datafile and the <filename>.scraps.bam
contains the removed adapters, barcodes and rejected reads (the related .bam.pbi files are the index file for these).
More details can be found here:
https://pacbiofileformats.readthedocs.io/en/3.0/BAM.html
or here:
https://www.pacb.com/wp-content/uploads/SMRT-Link-Getting-Started-Guide-v4.0.0.pdf
Analysis overview
As this analysis represents some preparatory steps to run the data in a full 16S analysis pipeline, it only involves two main analysis steps:
Circular Consensus Sequence (CCS) generation
Demultiplexing
See the below sections for details.
Note that the end result of this analysis is the generation of bam files containing one error-corrected 16S sequence per sample. These can be then run in a standard downstream analysis pipeline, such as QIIME2.
In addition, eResearch have developed a full 16S amplicon analysis pipeline using NextFlow. We recommend you use the bam files generated from this PacBio analysis as input for this pipeline:
nf-core/ampliseq 16S amplicon analysis pipeline (NextFlow Ampliseq)
Running analysis on QUTs HPC
The PacBio data files are large (multiple gigabytes) and the SMRT tools are designed to run on a Linux server, thus a high memory, multi core Linux server is recommended to run this analysis.
Accessing the HPC
QUT staff and HDR students can analyse their PacBio SMRT data on QUTs high performance computing cluster.
If you haven’t been set up or have used the HPC previously, click on this link for information on how to get access to and use the HPC:
Need a link here for HPC access and usage
Creating a shared workspace on the HPC
If you already have access to the HPC, we strongly recommend that you request a shared directory to store and analyse your data. This is so yourself, and relevant members of your research team (e.g. your HPC supervisor) and eResearch bioinformaticians can access your data and assist with the analysis.
To request a shared directory, submit a request to HPC support, telling them what you want the directory to be called (e.g. your_name_16S) and who you want to access it.
https://eresearchqut.atlassian.net/servicedesk/customer/portals
Running your analysis using the Portable Batch System (PBS)
The HPC has multiple ‘nodes’ available, each of which are allocated certain amounts of RAM and CPU cores.
You should run your data on one of these nodes, with a suitable amount of RAM and cores for your analysis needs. When you log on to the HPC you automatically are in the ‘head node’.
Do not run any of your analysis on the head node. Always request a node through the PBS.
To request a node using PBS, submit a shell script containing your RAM/CPU/analysis time requirements and the code needed to run your analysis. For an overview of submitting a PBS job, see here:
Need a link here for creating PBS jobs
Alternatively, you can start up an ‘interactive’ node, using the following:
This asks for a node with 32 CPUs and 64GB of memory, that will run for 72hrs. From our testing this should be adequate to run the following analysis on a PacBio SMRT dataset.
Note that when you request a node, you are put into a queue until a node of those specifications becomes available. Depending on how many people are using the HPC and how much resources they are using, this may take a while, in which case you just need to wait until the job starts (“waiting for job xxxx.pbs to start”).
Once the interactive PBS job begins (you get a command prompt) you can run the following analysis:
Generating consensus sequences
Use the SMRT Tools command-line tool ‘ccs’.
Note the --min-rq 0.999
. This tells ccs to filter for Q30, i.e. 1 error in 1000. By default ccs filters for Q20, i.e. 1 error in 100. Using 16S we are examining genetic differences between species, and these differences can be just a few base pairs, thus a Q30 error rate is a better option than Q20.
In the above code enter the path to the xml filename (<path-to-sample-xml-file>
) associated with your raw data (bam) file. This will be a file called <sample_name>.subreadset.xml
. As mentioned in the ‘PacBio SMRT data file structure’ section above, PacBio generates multiple files per sample (or pooled samples). The xml file lists all the files associated with one sample.
Give the output file a meaningful name, e.g. ‘samples_pool_1.ccs.bam’
ccs generates a report file, called ‘ccs_report.txt’ by default. You should give it a unique name by adding --report-file <sample_name>_ccs_report.txt
(e.g. ‘samples_pool_1_ccs_report.txt') to the ccs command, or else when you run ccs on another sample the previous ccs_report.txt will be overwritten.
Demultiplexing
A complete guide for demultiplexing is found here:
GitHub - PacificBiosciences/barcoding: Lima - Demultiplex Barcoded PacBio Samples
The ccs output is a bam file containing your multiplexed (i.e. pooled) samples. Each sample will have been barcoded. To extract each sample from the pool, run lima
You will need to find out from your sequencing provider which barcode set was used. This is provided as a fasta file containing the full list of barcodes. The barcodes fasta file used as input to run lima, where each barcode in the file is matched against each read. Note that you can provide a fasta file with just the barcodes you know were used in the current pool you are demultiplexing, or you can provide the full
A few of the main barcode sets used by PacBio are:
Barcoded F/R Universal Primers Plate - 96 v1
https://www.pacb.com/wp-content/uploads/Sequel_RSII_96_barcodes_v1.fasta_.zip
Barcoded F/R Universal Primers Plate - 96 v2
https://www.pacb.com/wp-content/uploads/Sequel_96_barcodes_v2.zip
Barcoded Overhang Adapter Kit – 8A/8B (16 pooled samples)
https://www.pacb.com/wp-content/uploads/Sequel_16_Barcodes_v3.zip
It’s likely that your samples were barcoded with one of these sets (and you can run your data against each of these to see if your reads match any of the barcodes), but again, check with the person/company that sequenced your data to be sure.
Run lima on each of your samples pools (bam files output from the ccs step) like so:
In the above example symmetrical barcodes were used - i.e. the same barcode on the 3' and 5' read ends, thus the --same
parameter was used.
If your reads used asymmetrical barcodes, use the --different
parameter.
--split-bam-named
tells lima to output separate bam files, one for each sample. Note that bam files are output for every barcode match, which, if using the --different
parameter can produce 1000’s of bam files, one for each read that match a barcode pair combination. Most of these files will only have a few reads. The true sample files will be the largest ones (which should match the samples and barcodes table provided to you by your sequencing company).
<ccs_output>.bam
is the circular consensus sequence error corrected bam file generated by ccs (see the ccs section in this guide).
Sequel_96_barcodes_v1.fasta
is the fasta of the barcodes used for your samples. Again, check with your sequencing company to get the correct barcodes set.
/path/to/output/pool1/pool1_demux.bam
is a user-provided path and suffix for your samples. In this case I called it ‘pool 1’ and suffixed all the output bam files with ‘pool1_demux’. Create a directory path and suffix that is suitable and meaningful for your data.
Demultiplexing output
Every barcode in the provided fasta file will be matched against the reads in your provided bam file containing your pooled samples.
This will often generate several output bam files where a barcode matched to just a few reads. These are ‘false positives’. The largest files by far, with the majority of reads, should be the correct samples. Check that these match the samples table provided to you from your sequencing company (which should have barcodes used per sample). These are the files you want to use in downstream analysis.
The output bam files are named according to the barcode pair that matched the reads contained within, with the suffix you provided in your lima command.
Additional files are a table of read counts per primer pair, e.g.
IdxFirst | IdxCombined | IdxFirstNamed | IdxCombinedNamed | Counts | MeanScore |
3 | 3 | bc1004 | bc1004 | 2 | 56 |
5 | 5 | bc1006 | bc1006 | 2 | 56 |
7 | 7 | bc1008 | bc1008 | 1 | 25 |
13 | 13 | bc1014 | bc1014 | 1416 | 96 |
16 | 16 | bc1017 | bc1017 | 8254 | 96 |
17 | 17 | bc1018 | bc1018 | 7647 | 95 |
18 | 18 | bc1019 | bc1019 | 8615 | 95 |
You can see that this is the output of a symmetrically barcoded dataset, as the barcodes are the same.
You can also see that barcode ‘bc1004’ matched just 2 reads (thus doesn’t represent a true sample) and barcode bc1017 is a true sample with a large amount of matched reads.
In addition to the table of reads counts is a summary of matched reads, e.g.
ZMWs input (A) : 209855
ZMWs above all thresholds (B) : 200943 (96%)
ZMWs below any threshold (C) : 8912 (4%)
ZMW marginals for (C):
Below min length : 0 (0%)
Below min score : 0 (0%)
Below min end score : 0 (0%)
Below min passes : 0 (0%)
Below min score lead : 455 (5%)
Below min ref span : 361 (4%)
Without SMRTbell adapter : 0 (0%)
Undesired diff pairs : 8407 (94%)
ZMWs for (B):
With same pair : 200943 (100%)
Coefficient of correlation : 52.75%
ZMWs for (A):
Allow diff pair : 209855 (100%)
Allow same pair : 209855 (100%)
Reads for (B):
Above length : 209322 (100%)
Below length : 0 (0%)
Post-demultiplexing file processing
The output files from lima
are bam files, one for each sample. They are by default named as the barcode pair used for that sample. A samples table with barcode pairs and sample IDs should be provided to you by your sequencing company. Rename each barcode-named file to the sample ID in this table.
For other downstream analysis, such as using these samples in the ampliseq pipeline (recommended), these bam files need to be converted to fastq files.
First the bam files need to be sorted using samtools.
Run the following in each directory where the output bam files reside. Note this will loop though each bam file, so ensure only the output bam files are in that directory.
This will sort each bam file and rename it to *sorted.bam
Then convert these sorted bam files to fastq files using bedtools bamToFastq tool.
The fastq files can then be used in downstream analysis. Follow our ampliseq (16S analysis pipeline) guide here:
nf-core/ampliseq 16S amplicon analysis pipeline (NextFlow Ampliseq)