Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

...

If you haven’t been set up or have used the HPC previously, click on this link for information on how to get access to and use the HPC:

Need a link here for HPC access and usage

 

Creating a shared workspace on the HPC

...

To request a node using PBS, submit a shell script containing your RAM/CPU/analysis time requirements and the code needed to run your analysis. For an overview of submitting a PBS job, see here:

Need a link here for creating PBS jobs

Alternatively, you can start up an ‘interactive’ node, using the following:

...

ccs generates a report file, called ‘ccs_report.txt’ by default. You should give it a unique name by adding --report-file <sample_name>_ccs_report.txt (e.g. ‘samples_pool_1_ccs_report.txt') to the ccs command, or else when you run ccs on another sample the previous ccs_report.txt will be overwritten.

 

ccs with Q20 (default)

...

m54105_201123_035001

...

m54105_201125_211823

...

m54105_201125_005856

...

m54105_201126_173825

...

ZMWs input (A) 

...

330737

...

263116

...

217071

...

ZMWs generating CCS (B) 

...

 251126 (75.93%)

...

 208122 (79.10%)

...

 168434 (77.59%)

...

 138080 (80.85%)

...

ZMWs filtered (C) 

...

 79611 (24.07%)

...

 54994 (20.90%)

...

 48637 (22.41%)

...

 32713 (19.15%)

...

Exclusive ZMW counts for (C)

...

Median length filter    

...

 0 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

...

Below SNR threshold     

...

 0 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

...

Lacking full passes     

...

 54483 (68.44%)

...

 39690 (72.17%)

...

 32964 (67.78%)

...

 22249 (68.01%)

...

Heteroduplex insertions 

...

 470 (0.59%)

...

 306 (0.56%)

...

 280 (0.58%)

...

 191 (0.58%)

...

Coverage drops           

...

 82 (0.10%)

...

 34 (0.06%)

...

 53 (0.11%)

...

 32 (0.10%)

...

Insufficient draft cov  

...

 1045 (1.31%)

...

 889 (1.62%)

...

 635 (1.31%)

...

 610 (1.86%)

...

Draft too different     

...

 809 (1.02%)

...

 337 (0.61%)

...

 430 (0.88%)

...

 160 (0.49%)

...

Draft generation error  

...

 3212 (4.03%)

...

 1570 (2.85%)

...

 1767 (3.63%)

...

 1354 (4.14%)

...

Draft above --max-length

...

 2 (0.00%)

...

 1 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

...

Draft below --min-length

...

 0 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

...

Reads failed polishing  

...

 0 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

...

Empty coverage windows  

...

 14 (0.02%)

...

 13 (0.02%)

...

 11 (0.02%)

...

 10 (0.03%)

...

CCS did not converge    

...

 0 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

...

 1 (0.00%)

...

CCS below minimum RQ    

...

 19496 (24.49%)

...

 12155 (22.10%)

...

 12497 (25.69%)

...

 8106 (24.78%)

...

Unknown error           

...

 0 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

Q30

...

m54105_201123_035001

...

m54105_201125_211823

...

m54105_201125_005856

...

m54105_201126_173825

...

ZMWs input          (A)  

...

330737

...

263116

...

217071

...

170793

...

ZMWs generating CCS (B)  

...

 209855 (63.45%)

...

 177981 (67.64%)

...

 138732 (63.91%)

...

 116886 (68.44%)

...

ZMWs filtered       (C)  

...

 120882 (36.55%)

...

 85135 (32.36%)

...

 78339 (36.09%)

...

 53907 (31.56%)

...

Exclusive ZMW counts for (C)

...

Median length filter     

...

 0 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

...

Below SNR threshold      

...

 0 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

...

Lacking full passes      

...

 54483 (45.07%)

...

 39690 (46.62%)

...

 32964 (42.08%)

...

 22249 (41.27%)

...

Heteroduplex insertions  

...

 470 (0.39%)

...

 306 (0.36%)

...

 280 (0.36%)

...

 191 (0.35%)

...

Coverage drops           

...

 82 (0.07%)

...

 34 (0.04%)

...

 53 (0.07%)

...

 32 (0.06%)

...

Insufficient draft cov   

...

 1045 (0.86%)

...

 889 (1.04%)

...

 635 (0.81%)

...

 610 (1.13%)

...

Draft too different      

...

 809 (0.67%)

...

 337 (0.40%)

...

 430 (0.55%)

...

 160 (0.30%)

...

Draft generation error   

...

 3212 (2.66%)

...

 1570 (1.84%)

...

 1767 (2.26%)

...

 1354 (2.51%)

...

Draft above --max-length 

...

 2 (0.00%)

...

 1 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

...

Draft below --min-length 

...

 0 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

...

Reads failed polishing   

...

 0 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

...

Empty coverage windows   

...

 14 (0.01%)

...

 13 (0.02%)

...

 11 (0.01%)

...

 10 (0.02%)

...

CCS did not converge     

...

 0 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

...

 1 (0.00%)

...

CCS below minimum RQ     

...

 60767 (50.27%)

...

 42296 (49.68%)

...

 42199 (53.87%)

...

 29300 (54.35%)

...

Unknown error            

...

 0 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

...

 0 (0.00%)

DemultiplexingDemultiplexing

A complete guide for demultiplexing is found here:

https://github.com/PacificBiosciences/barcoding

The ccs output is a bam file containing your multiplexed (i.e. pooled) samples. Each sample will have been barcoded. To extract each sample from the pool, run lima

You will need to find out from your sequencing provider which barcode set was used. This is provided as a fasta file containing the full list of barcodes. The barcodes fasta file used as input to run lima, where each barcode in the file is matched against each read. Note that you can provide a fasta file with just the barcodes you know were used in the current pool you are demultiplexing, or you can provide the full

A few of the main barcode sets used by PacBio are:

Barcoded F/R Universal Primers Plate - 96 v1

https://www.pacb.com/wp-content/uploads/Sequel_RSII_96_barcodes_v1.fasta_.zip

Barcoded F/R Universal Primers Plate - 96 v2

https://www.pacb.com/wp-content/uploads/Sequel_96_barcodes_v2.zip

Barcoded Overhang Adapter Kit – 8A/8B (16 pooled samples)

https://www.pacb.com/wp-content/uploads/Sequel_16_Barcodes_v3.zip

It’s likely that your samples were barcoded with one of these sets (and you can run your data against each of these to see if your reads match any of the barcodes), but again, check with the person/company that sequenced your data to be sure.

Run lima on each of your samples pools (bam files output from the ccs step) like so:

Code Block
lima --same --ccs --split-bam-named <ccs_output>.bam Sequel_96_barcodes_v1.fasta /path/to/output/pool1/pool1_demux.bam

In the above example symmetrical barcodes were used - i.e. the same barcode on the 3' and 5' read ends, thus the --same parameter was used.

If your reads used asymmetrical barcodes, use the --different parameter.

--split-bam-named tells lima to output separate bam files, one for each sample. Note that bam files are output for every barcode match, which, if using the --different parameter can produce 1000’s of bam files, one for each read that match a barcode pair combination. Most of these files will only have a few reads. The true sample files will be the largest ones (which should match the samples and barcodes table provided to you by your sequencing company).

<ccs_output>.bam is the circular consensus sequence error corrected bam file generated by ccs (see the ccs section in this guide).

Sequel_96_barcodes_v1.fasta is the fasta of the barcodes used for your samples. Again, check with your sequencing company to get the correct barcodes set.

/path/to/output/pool1/pool1_demux.bam is a user-provided path and suffix for your samples. In this case I called it ‘pool 1’ and suffixed all the output bam files with ‘pool1_demux’. Create a directory path and suffix that is suitable and meaningful for your data.

Demultiplexing output

Every barcode in the provided fasta file will be matched against the reads in your provided bam file containing your pooled samples.

This will often generate several output bam files where a barcode matched to just a few reads. These are ‘false positives’. The largest files by far, with the majority of reads, should be the correct samples. Check that these match the samples table provided to you from your sequencing company (which should have barcodes used per sample). These are the files you want to use in downstream analysis.

The output bam files are named according to the barcode pair that matched the reads contained within, with the suffix you provided in your lima command.

Additional files are a table of read counts per primer pair, e.g.

IdxFirst

IdxCombined

IdxFirstNamed

IdxCombinedNamed

Counts

MeanScore

3

3

bc1004

bc1004

2

56

5

5

bc1006

bc1006

2

56

7

7

bc1008

bc1008

1

25

13

13

bc1014

bc1014

1416

96

16

16

bc1017

bc1017

8254

96

17

17

bc1018

bc1018

7647

95

18

18

bc1019

bc1019

8615

95

You can see that this is the output of a symmetrically barcoded dataset, as the barcodes are the same.

You can also see that barcode ‘bc1004’ matched just 2 reads (thus doesn’t represent a true sample) and barcode bc1017 is a true sample with a large amount of matched reads.

In addition to the table of reads counts is a summary of matched reads, e.g.

ZMWs input (A) : 209855
ZMWs above all thresholds (B) : 200943 (96%)
ZMWs below any threshold (C) : 8912 (4%)

ZMW marginals for (C):
Below min length : 0 (0%)
Below min score : 0 (0%)
Below min end score : 0 (0%)
Below min passes : 0 (0%)
Below min score lead : 455 (5%)
Below min ref span : 361 (4%)
Without SMRTbell adapter : 0 (0%)
Undesired diff pairs : 8407 (94%)

ZMWs for (B):
With same pair : 200943 (100%)
Coefficient of correlation : 52.75%

ZMWs for (A):
Allow diff pair : 209855 (100%)
Allow same pair : 209855 (100%)

Reads for (B):
Above length : 209322 (100%)
Below length : 0 (0%)