Anacapa is a toolkit designed to construct reference databases and assign taxonomy, from eDNA sequences.
For more details on anacapa, please read though the anacapa Github page:
Table of contents
Purpose of this guide
This guide is designed to step you though running your eDNA sequence data through the anacapa toolkit on QUTs HPC, as the published anacapa documentation on Github can be a bit hard to follow and needs some modification to work on the HPC.
This guide was developed and written by QUT’s eResearch team. For information about this guide or other bioinformatic analyses, contact us at eresearch@qut.edu.au
Requirements
Your eDNA sample files, which should be demultiplexed Illumina sequences in fastq format. If they are not demultiplexed or not Illumina, contact us at eResearch: eresearch@qut.edu.au
A table of the barcodes and adapters used to amplify your sequences. If you don’t already have these, you can usually request them from the organisation that sequenced your samples.
A QUT HPC account.
A basic knowledge of Linux command line operation and usage of QUT’s HPC is strongly recommended, but not required, as all the command line instructions are explicitly explained and can usually simply be cut and pasted into your HPC command line.
An overview of HPC commands and usage, as well as a link for requesting access to the HPC (if you don’t currently have a HPC account) is here:
There are plenty of online guides that teach basic Linux command line usage, for example:
https://www.youtube.com/watch?v=cBokz0LTizk&t=1s
https://www.youtube.com/watch?v=s3ii48qYBxA
How to use this guide
In this guide, commands to be entered by the user will be in grey boxes like the one below. Most commands can simply be cut and paste ‘as-is’ into your command line. Some need to be modified due to variations in your data (e.g. target species) or location.
You can hover your mouse over the code box to see a ‘copy’ button on the right. Just click this to copy all the code in the box.
Try this with the code box below (this will show the directory paths defined by your profile).
echo $PATH
Step 1: initial setup
You will be running various processes on the HPC that require quite a lot of processing power. Do not run these command on the 'head node' (which is the node you enter when you log on). Instead, either submit these commands via a PBS script or an interactive PBS session, which runs your processes on another node.
The details of creating and submitting a PBS script can be found here:
If you’re testing several tools or running multiple separate commands then an interactive PBS session may be preferable. Below is the command to create an interactive PBS session with 8 CPUs, 64GB memory and a maximum running time of 11 hours (12 hours is the absolute maximum that can be requested for an interactive session).
qsub -I -S /bin/bash -l walltime=11:00:00 -l select=1:ncpus=8:mem=64gb
This request gets put in the HPC queue until there is an available node with sufficient resources. This may take several minutes, or possibly longer.
Create your working directory
From your home directory, create a subdirectory called ‘anacapa
’ and enter this subdirectory.
cd ~ mkdir anacapa cd anacapa
Create a directory for your fastq files and move them there
The fastq directory should be created in your anacapa directory.
mkdir ~/anacapa/fastq
Move your fastq files to this directory. Your fastq files will need to be uploaded to the HPC first. To copy them from a Windows PC to the HPC, you can use a tool like WinSCP: https://winscp.net/eng/index.php
You can either copy them from your local PC, directly to the fastq directory you created (using something like WinSCP) or if they are already on the HPC but in a different directory, move to that directory ('cd ~/directory_where_fastq_files_are
') then copy them across to the anacapa/fastq directory you created:
cp *.fastq.gz ~/anacapa/fastq
*NOTE: the above command assumes your fastq files have the ‘.fastq.gz
’ suffix, which is the most common. But they may be uncompressed (i.e. just ‘samplename.fastq
’) or something like samplename.fq.gz, in which case you’d change the above to 'cp *.fq.gz ~/anacapa/fastq
'
Step 2: Running anacapa on Singularity
Anacapa uses many tools, which would be difficult and time consuming to install all of them on the HPC. Fortunately, the developers of Anacapa have created a Singularity image that contains all the required tools. Once the image is downloaded, all the standard tools and commands in the Anacapa guide can be run by prefixing them with ‘singularity exec anacapa-1.5.0.img
’ which runs the subsequent command in the singularity container.
Information about running Anacapa in the singularity container is found here:
Download the Anacapa Singularity container to your anacapa directory.
cd ~/anacapa wget https://zenodo.org/record/2602180/files/anacapa-1.5.0.img
Step 3: Create reference libraries using CRUX
CRUX (Creating-Reference-libraries-Using-eXisting-tools) generates taxonomic reference libraries by querying your primers against an ecoPCR database. The purpose of Step 3 is to download the required databases and then use them to generate this ecoPCR database.
Anacapa contains several pre-built ecoPCR databases, based on defined primer sets, which can be seen in the ‘High level overview’ section on the anacapa page: GitHub - Anacapa.
If you are using a set of primers that aren’t on this list you’ll need to construct your own ecoPCR database, by following this guide.
For this guide we will be using eDNA sequences amplified by the 16Smam primer pair:
16S701F 5′-CGGTTGGGGTGACCTCGGA-3′
16S787R 5′-GCTGTTATCCCTAGGGTAACT-3′
These primers were developed to amplify mammal sequences (which is an important point, as you will download the EMBL databases that correspond the taxonomic group you’re interested in).
To run CRUX you need to first download and setup 4 databases: 1) NCBI taxonomy, 2) NCBI BLAST nt library, 3) NCBI accession2taxonomy, 4) EMBL std nucleotide database (for your taxonomic group of interest).
First, create the directory to hold these databases:
mkdir ~/anacapa/crux_db
Download NCBI taxonomy database
Download and decompress the database to a subdirectory called TAXO
:
mkdir ~/anacapa/crux_db/TAXO cd ~/anacapa/crux_db/TAXO wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz tar -xzvf taxdump.tar.gz
Download the NCBI nt library
Download and decompress the database to a subdirectory called NCBI_blast_nt
:
mkdir ~/anacapa/crux_db/NCBI_blast_nt cd ~/anacapa/crux_db/NCBI_blast_nt wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt* for file in nt*.tar.gz; do tar -zxf $file; done
*NOTE: This is the full NCBI nucleotide database. It is VERY large (~170GB). In the future eResearch will be making available a centralised, frequently updated copy of this on the HPC that all researchers can access, so it doesn’t have to be downloaded multiple times. In the mean time, you can download it to ~/anacapa/crux_db/NCBI_blast_nt
and then please delete the database once you have completed your anacapa analysis. We don’t want multiple copies of this same database on the HPC.
Download the NCBI accession2taxonomy database
Download and decompress the database to a subdirectory called accession2taxonomy
:
mkdir ~/anacapa/crux_db/accession2taxonomy cd ~/anacapa/crux_db/accession2taxonomy wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz gzip -d nucl_gb.accession2taxid.gz
Download the EMBL std nucleotide database files
The FTP location of the EMBL databases, as provided in the CRUX documentation, is incorrect.
But reading the EMBL database notes at …
ftp://ftp.ebi.ac.uk/pub/databases/embl/release/doc/relnotes.txt
… in section 7 it lists all the database names. The CRUX documentation says we need standard sequences, and for this example guide we are looking at mammals. In which case there are two mammal std nucleotide database files listed:
rel_std_mam_01_r143.dat
and
rel_std_mam_02_r143.dat
Searching the database names, I found them hosted (gzipped) here:
https://www.funet.fi/pub/sci/molbio/embl_release/std/
The other (i.e. other than mammalian) EMBL std nucleotide taxonomic databases are also at this site.
Download and decompress these databases to a subdirectory called EMBL
:
mkdir ~/anacapa/crux_db/EMBL cd ~/anacapa/crux_db/EMBL wget https://www.funet.fi/pub/sci/molbio/embl_release/std/rel_std_mam_01_r143.dat.gz wget https://www.funet.fi/pub/sci/molbio/embl_release/std/rel_std_mam_02_r143.dat.gz gzip -d rel_std_mam_01_r143.dat.gz gzip -d rel_std_mam_02_r143.dat.gz
Again, in this guide we’re just looking at mammal sequences. If you’re looking at another taxonomic group, you’ll need to download the appropriate databases. below are the codes for the available EMBL taxonomic groups.
Division Code ---------------- ------------------ Bacteriophage PHG - common Environmental Sample ENV - common Fungal FUN - map to PLN (plants + fungal) Human HUM - map to PRI (primates) Invertebrate INV - common Other Mammal MAM - common Other Vertebrate VRT - common Mus musculus MUS - map to ROD (rodent) Plant PLN - common Prokaryote PRO - map to BCT (poor name) Other Rodent ROD - common Synthetic SYN - common Transgenic TGN - ??? map to SYN ??? Unclassified UNC - map to UNK Viral VRL - common
So, if for example you are looking at all vertebrates (other than human), you would download all the database files beginning with ‘rel_std_vrt
' or for plants you’d download all 'rel_std_pln
' etc.
Convert downloaded databases to ecoPCR format
To run CRUX, the NCBI and EMBL nucleotide databases need to first be converted to ecoPCR format, using the obiconvert
command.
First create directories to output these databases:
mkdir -p ~/anacapa/crux_db/Obitools_databases/OB_dat_120322_EMBL_std_mam
The naming of these directories is important, as the CRUX script automatically looks in the /crux_db/Obitools_databases
directory for any databases beginning with OB_dat_
.
Run the obiconvert
command from the anacapa Singularity image.
singularity exec ~/anacapa/anacapa-1.5.0.img obiconvert -t ~/anacapa/crux_db/TAXO --embl --ecopcrdb-output=~/anacapa/crux_db/Obitools_databases/OB_dat_120322_EMBL_std_mam/OB_dat_120322_EMBL_std_mam ~/anacapa/crux_db/EMBL/*.dat --skip-on-error
This uses the NCBI taxonomy database (downloaded to ~/anacapa/crux_db/TAXO
) and the EMBL database (downloaded to ~/anacapa/crux_db/EMBL/*.dat
) and it outputs the ecoPCR converted database to /Obitools_databases/OB_dat_EMBL_std/
and prepends the generated ecoPCR database files with OB_dat_EMBL_std...
.
If you have downloaded and extracted all the databases in the correct directories you should now see obiconvert
running with the following messages:
Reading taxonomy dump file... List all taxonomy rank... Indexing taxonomy... Indexing parent and rank... Adding scientific name... Adding taxid alias... Adding deleted taxid... ....
During initial testing on the mammal EMBL databases, this took about 8 hours to complete. Note that a PBS interactive session has a maximum time limit of 12 hours (and we requested 11 hours when we started our session). If you are working with a larger dataset - e.g. vertebrates or invertebrates - this process may take much longer, and in fact longer than an interactive session will run, requiring you to submit the above obiconvert
command as a PBS script (again, see HPC for instructions on how to do this).
Step 4: Running CRUX
Once you have downloaded and converted the required databases (section above), you can run CRUX.
CRUX generates taxonomic reference libraries by querying your primers against an ecoPCR database you generated in Step 3. Anacapa then uses these libraries for taxonomic assignment of your sequences.
Cleanup
Running the anacapa workflow involves downloading and generating various large databases. These will just take up space on the HPC unless removed.
If you will be running more samples on these databases in the near future you can retain them, otherwise they should be removed.