Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Anacapa uses many tools, which would be difficult and time consuming to install all of them on the HPC. Fortunately, the developers of Anacapa have created a Singularity image that contains all the required tools. Once the image is downloaded, all the standard tools and commands in the Anacapa guide can be run by prefixing them with ‘singularity singularity exec anacapa-1.5.0.img’ imgwhich runs the subsequent command in the singularity container.

...

Code Block
cd ~/anacapa
wget https://zenodo.org/record/2602180/files/anacapa-1.5.0.img

Step 3: Create reference libraries using CRUX

CRUX (Creating-Reference-libraries-Using-eXisting-tools) generates taxonomic reference libraries by querying your primers against the ecoPCR database you generated in the previous section. Anacapa then uses these libraries for taxonomic assignment of your sequences.

Anacapa contains several pre-built databases, based on defined primer sets, which can be seen in the ‘High level overview’ section on the anacapa page: GitHub - Anacapa.

If you are using a set of primers that aren’t on this list you’ll need to construct your own CRUX database, by following this guide.

For this guide we will be using eDNA sequences amplified by the 16Smam primer pair:

16S701F 5′-CGGTTGGGGTGACCTCGGA-3′

16S787R 5′-GCTGTTATCCCTAGGGTAACT-3′

These primers were developed to amplify mammal sequences (which is an important point, as you will download the EMBL databases that correspond the taxonomic group you’re interested in).

To run CRUX you need to first download and setup 4 databases: 1) NCBI taxonomy, 2) NCBI BLAST nt library, 3) NCBI accession2taxonomy, 4) EMBL std nucleotide database (for your taxonomic group of interest).

First, create the directory to hold these databases:

Code Block
mkdir ~/anacapa/crux_db

Download NCBI taxonomy database

Download and decompress the database to a subdirectory called TAXO:

Code Block
mkdir ~/anacapa/crux_db/TAXO
cd ~/anacapa/crux_db/TAXO
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -xzvf taxdump.tar.gz

Download the NCBI nt library

Download and decompress the database to a subdirectory called NCBI_blast_nt:

Code Block
mkdir ~/anacapa/crux_db/NCBI_blast_nt
cd ~/anacapa/crux_db/NCBI_blast_nt
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt*
for file in nt*.tar.gz; do tar -zxf $file; done

*NOTE: This is the full NCBI nucleotide database. It is VERY large (~170GB). In the future eResearch will be making available a centralised, frequently updated copy of this on the HPC that all researchers can access, so it doesn’t have to be downloaded multiple times. In the mean time, you can download it to ~/anacapa/crux_db/NCBI_blast_nt and then please delete the database once you have completed your anacapa analysis. We don’t want multiple copies of this same database on the HPC.

Download the NCBI accession2taxonomy database

Download and decompress the database to a subdirectory called accession2taxonomy:

Code Block
mkdir ~/anacapa/crux_db/accession2taxonomy
cd ~/anacapa/crux_db/accession2taxonomy
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
gzip -d nucl_gb.accession2taxid.gz

Download the EMBL std nucleotide database files

The FTP location of the EMBL databases, as provided in the CRUX documentation, is incorrect.

But reading the EMBL database notes at …

ftp://ftp.ebi.ac.uk/pub/databases/embl/release/doc/relnotes.txt

… in section 7 it lists all the database names. The CRUX documentation says we need standard sequences, and for this example guide we are looking at mammals. In which case there are two mammal std nucleotide database files listed:

rel_std_mam_01_r143.dat

and

rel_std_mam_02_r143.dat

Searching the database names, I found them hosted (gzipped) here:

https://www.funet.fi/pub/sci/molbio/embl_release/std/

The other (i.e. other than mammalian) EMBL std nucleotide taxonomic databases are also at this site.

Download and decompress these databases to a subdirectory called EMBL:

Code Block
mkdir ~/anacapa/crux_db/EMBL
cd ~/anacapa/crux_db/EMBL
wget https://www.funet.fi/pub/sci/molbio/embl_release/std/rel_std_mam_01_r143.dat.gz
wget https://www.funet.fi/pub/sci/molbio/embl_release/std/rel_std_mam_02_r143.dat.gz
gzip -d rel_std_mam_01_r143.dat.gz
gzip -d rel_std_mam_02_r143.dat.gz

Again, in this guide we’re just looking at mammal sequences. If you’re looking at another taxonomic group, you’ll need to download the appropriate databases. below are the codes for the available EMBL taxonomic groups.

Code Block
Division                 Code
----------------         ------------------
Bacteriophage            PHG - common
Environmental Sample     ENV - common
Fungal                   FUN - map to PLN (plants + fungal)
Human                    HUM - map to PRI (primates)
Invertebrate             INV - common
Other Mammal             MAM - common
Other Vertebrate         VRT - common
Mus musculus             MUS - map to ROD (rodent)
Plant                    PLN - common
Prokaryote               PRO - map to BCT (poor name)
Other Rodent             ROD - common
Synthetic                SYN - common
Transgenic               TGN - ??? map to SYN ???
Unclassified             UNC - map to UNK
Viral                    VRL - common

So, if for example you are looking at all vertebrates (other than human), you would download all the database files beginning with ‘rel_std_vrt' or for plants you’d download all 'rel_std_pln' etc.

Convert downloaded databases to ecoPCR format

To run CRUX, the NCBI and EMBL nucleotide databases need to first be converted to ecoPCR format, using the obiconvertcommand.

First create directories to output these databases:

Code Block
mkdir ~/anacapa/crux_db/Obitools_databases
mkdir ~/anacapa/crux_db/Obitools_databases/OB_dat_EMBL_std

The naming of these directories is important, as the CRUX script automatically looks in the /crux_db/Obitools_databases directory for any databases beginning with OB_dat_.

Run the obiconvert command

Code Block
singularity exec ~/anacapa/anacapa-1.5.0.img obiconvert -t ~/anacapa/crux_db/TAXO --embl --ecopcrdb-output=~/anacapa/crux_db/Obitools_databases/OB_dat_EMBL_std/OB_dat_EMBL_std ~/anacapa/crux_db/EMBL/*.dat --skip-on-error

Cleanup

Running the anacapa workflow involves downloading and generating various large databases. These will just take up space on the HPC unless removed.

If you will be running more samples on these databases in the near future you can retain them, otherwise they should be removed.