Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Anacapa is a toolkit designed to construct reference databases and assign taxonomy, from eDNA sequences.

For more details on anacapa, please read though the anacapa Github page:

GitHub - Anacapa

Table of contents

Table of Contents
minLevel1
maxLevel7

Purpose of this guide

This guide is designed to step you though running your eDNA sequence data through the anacapa toolkit on QUTs HPC, as the published anacapa documentation on Github can be a bit hard to follow and needs some modification to work on the HPC.

...

An overview of HPC commands and usage, as well as a link for requesting access to the HPC (if you don’t currently have a HPC account) is here:

Start using the HPC

There are plenty of online guides that teach basic Linux command line usage, for example:

...

https://www.youtube.com/watch?v=s3ii48qYBxA

How to use this guide

In this guide, commands to be entered by the user will be in grey boxes like the one below. Most commands can simply be cut and paste ‘as-is’ into your command line. Some need to be modified due to variations in your data (e.g. target species) or location.

You can hover your mouse over the code box to see a ‘copy’ button on the right. Just click this to copy all the code in the box.

Try this with the code box below (this will show the directory paths defined by your profile).

Code Block
languagebash
echo $PATH

Step 1: initial setup

You will be running various processes on the HPC that require quite a lot of processing power. Do not run these command on the 'head node' (which is the node you enter when you log on). Instead, either submit these command commands via a PBS script or an interactive PBS session, which runs your processes on another node.

The details of creating and submitting a PBS script can be found here:

Start using the HPC

If you’re testing several tools or running multiple separate commands then an interactive PBS session may be preferable. Below is the command to create an interactive PBS session with 8 CPUs and 64GB memory.*Note: In this guide, commands to be entered by the user will be in grey boxes like the one below. As with most commands, you can simply cut and paste this into your command line., 64GB memory and a maximum running time of 11 hours (12 hours is the absolute maximum that can be requested for an interactive session).

Code Block
qsub -I -S /bin/bash -l walltime=11:00:00 -l select=1:ncpus=8:mem=64gb

...

From your home directory, create a subdirectory called ‘anacapa’ and enter this subdirectory. This guide has commands that assume your

Code Block
cd ~
mkdir anacapa
cd anacapa

...

Anacapa uses many tools, which would be difficult and time consuming to install all of them on the HPC. Fortunately, the developers of Anacapa have created a Singularity image that contains all the required tools. Once the image is downloaded, all the standard tools and commands in the Anacapa guide can be run by prefixing them with ‘singularity singularity exec anacapa-1.5.0.img’ imgwhich runs the subsequent command in the singularity container.

...

Code Block
cd ~/anacapa
wget https://zenodo.org/record/2602180/files/anacapa-1.5.0.img

Step 3: Create reference libraries using CRUX

CRUX (Creating-Reference-libraries-Using-eXisting-tools) generates taxonomic reference libraries by querying your primers against an ecoPCR database. The purpose of Step 3 is to download the required databases and then use them to generate this ecoPCR database.

Anacapa contains several pre-built ecoPCR databases, based on defined primer sets, which can be seen in the ‘High level overview’ section on the anacapa page: GitHub - Anacapa.

If you are using a set of primers that aren’t on this list you’ll need to construct your own ecoPCR database, by following this guide.

For this guide we will be using eDNA sequences amplified by the 16Smam primer pair:

16S701F 5′-CGGTTGGGGTGACCTCGGA-3′

16S787R 5′-GCTGTTATCCCTAGGGTAACT-3′

These primers were developed to amplify mammal sequences (which is an important point, as you will download the EMBL databases that correspond the taxonomic group you’re interested in).

To run CRUX you need to first download and setup 4 databases: 1) NCBI taxonomy, 2) NCBI BLAST nt library, 3) NCBI accession2taxonomy, 4) EMBL std nucleotide database (for your taxonomic group of interest).

First, create the directory to hold these databases:

Code Block
mkdir ~/anacapa/crux_db

Download NCBI taxonomy database

Download and decompress the database to a subdirectory called TAXO:

Code Block
mkdir ~/anacapa/crux_db/TAXO
cd ~/anacapa/crux_db/TAXO
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -xzvf taxdump.tar.gz

Download the NCBI nt library

Download and decompress the database to a subdirectory called NCBI_blast_nt:

Code Block
mkdir ~/anacapa/crux_db/NCBI_blast_nt
cd ~/anacapa/crux_db/NCBI_blast_nt
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt*
for file in nt*.tar.gz; do tar -zxf $file; done

*NOTE: This is the full NCBI nucleotide database. It is VERY large (~170GB). In the future eResearch will be making available a centralised, frequently updated copy of this on the HPC that all researchers can access, so it doesn’t have to be downloaded multiple times. In the mean time, you can download it to ~/anacapa/crux_db/NCBI_blast_nt and then please delete the database once you have completed your anacapa analysis. We don’t want multiple copies of this same database on the HPC.

Download the NCBI accession2taxonomy database

Download and decompress the database to a subdirectory called accession2taxonomy:

Code Block
mkdir ~/anacapa/crux_db/accession2taxonomy
cd ~/anacapa/crux_db/accession2taxonomy
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
gzip -d nucl_gb.accession2taxid.gz

Download the EMBL std nucleotide database files

The FTP location of the EMBL databases, as provided in the CRUX documentation, is incorrect.

But reading the EMBL database notes at …

ftp://ftp.ebi.ac.uk/pub/databases/embl/release/doc/relnotes.txt

… in section 7 it lists all the database names. The CRUX documentation says we need standard sequences, and for this example guide we are looking at mammals. In which case there are two mammal std nucleotide database files listed:

rel_std_mam_01_r143.dat

and

rel_std_mam_02_r143.dat

Searching the database names, I found them hosted (gzipped) here:

https://www.funet.fi/pub/sci/molbio/embl_release/std/

The other (i.e. other than mammalian) EMBL std nucleotide taxonomic databases are also at this site.

Download and decompress these databases to a subdirectory called EMBL:

Code Block
mkdir ~/anacapa/crux_db/EMBL
cd ~/anacapa/crux_db/EMBL
wget https://www.funet.fi/pub/sci/molbio/embl_release/std/rel_std_mam_01_r143.dat.gz
wget https://www.funet.fi/pub/sci/molbio/embl_release/std/rel_std_mam_02_r143.dat.gz
gzip -d rel_std_mam_01_r143.dat.gz
gzip -d rel_std_mam_02_r143.dat.gz
Code Block
#alternatively download all EMBL files
wget https://www.funet.fi/pub/sci/molbio/embl_release/std/rel*

#uncompress all downloaded files
for i in `ls rel*`; do echo $i; gzip -d $i; done

Again, in this guide we’re just looking at mammal sequences. If you’re looking at another taxonomic group, you’ll need to download the appropriate databases. below are the codes for the available EMBL taxonomic groups.

Code Block
Division                 Code
----------------         ------------------
Bacteriophage            PHG - common
Environmental Sample     ENV - common
Fungal                   FUN - map to PLN (plants + fungal)
Human                    HUM - map to PRI (primates)
Invertebrate             INV - common
Other Mammal             MAM - common
Other Vertebrate         VRT - common
Mus musculus             MUS - map to ROD (rodent)
Plant                    PLN - common
Prokaryote               PRO - map to BCT (poor name)
Other Rodent             ROD - common
Synthetic                SYN - common
Transgenic               TGN - ??? map to SYN ???
Unclassified             UNC - map to UNK
Viral                    VRL - common

So, if for example you are looking at all vertebrates (other than human), you would download all the database files beginning with ‘rel_std_vrt' or for plants you’d download all 'rel_std_pln' etc.

Convert downloaded databases to ecoPCR format

To run CRUX, the NCBI and EMBL nucleotide databases need to first be converted to ecoPCR format, using the obiconvertcommand.

First create directories to output these databases:

Code Block
mkdir -p ~/anacapa/crux_db/Obitools_databases/OB_dat_EMBL_std

The naming of these directories is important, as the CRUX script automatically looks in the /crux_db/Obitools_databases directory for any databases beginning with OB_dat_.

Run the obiconvert command from the anacapa Singularity image.

Important: You need to change every instance of /home/your_home_directory in the below command to your actual home directory (this is because obiconvert requires absolute paths). To find your home directory path, type cd ~ and then pwd. Use the path that this displays to replace the /home/your_home_directory.

Code Block
singularity exec /home/your_home_directory/anacapa/anacapa-1.5.0.img obiconvert -t /home/your_home_directory/anacapa/crux_db/TAXO --embl --ecopcrdb-output=/home/your_home_directory/anacapa/crux_db/Obitools_databases/OB_dat_EMBL_std/OB_dat_EMBL_std /home/your_home_directory/anacapa/crux_db/EMBL/*.dat --skip-on-error

MAKE SURE THE OUTPUT DIRECTORY IS EMPTY (--ecopcrdb-output= ...). If you’ve previously run this obiconvert command (as a test, or if it failed) using this same output directory, there may be some leftover files in there, in which case obiconvert won’t overwrite them, but will sequentially add to the database.

The above obiconvert command uses the NCBI taxonomy database (downloaded to ~/anacapa/crux_db/TAXO) and the EMBL database (downloaded to ~/anacapa/crux_db/EMBL/*.dat) and it outputs the ecoPCR converted database to /Obitools_databases/OB_dat_EMBL_std/ and prepends the generated ecoPCR database files with OB_dat_EMBL_std....

If you have downloaded and extracted all the databases in the correct directories you should now see obiconvert running with the following messages:

Code Block
Reading taxonomy dump file...
List all taxonomy rank...
Indexing taxonomy...
Indexing parent and rank...
Adding scientific name...
Adding taxid alias...
Adding deleted taxid...
....

During initial testing on the mammal EMBL databases, this took about 8 hours to complete. Note that a PBS interactive session has a maximum time limit of 12 hours (and we requested 11 hours when we started our session). If you are working with a larger dataset - e.g. vertebrates or invertebrates - this process may take much longer, and in fact longer than an interactive session will run, requiring you to submit the above obiconvert command as a PBS script (again, see Start using the HPC for instructions on how to do this).

An example PBS script for running obitools can be seen below.

Code Block
#!/bin/bash -l
#PBS -N ObiRun
#PBS -l select=1:ncpus=2:mem=64gb
#PBS -l walltime=96:00:00

cd $PBS_O_WORKDIR

singularity exec /home/your_home_directory/anacapa/anacapa-1.5.0.img \
obiconvert \
-t /home/your_home_directory/anacapa/crux_db/TAXO \
--embl \
--ecopcrdb-output=/home/your_home_directory/anacapa/crux_db/Obitools_databases/OB_dat_EMBL_std/OB_dat_EMBL_std \
--skip-on-error \
/home/your_home_directory/anacapa/crux_db/EMBL/*.dat

As before, you’ll need to change the above directory locations to match where your singularity image is, your taxonomy database, your output directory and your EMBL database.

To create this script you can use a text editor like nano. In your HPC command line, type:

Code Block
module load nano

To load nano, then type:

Code Block
nano launch.pbs

This will create an empty PBS script file called ‘launch.pbs’. Copy and paste the PBS script text from the code block above into nano, then press control and o to save the file, then control and x to exit nano.

Now you can launch this as a PBS job by typing:

Code Block
qsub launch.pbs

Your job will be added to the queue, so may take some time to start if there are many jobs queued. You can check the status of your jobs by typing:

Code Block
qstat -u <username>

Change <username> to your own user (logon) name.

Step 4: Running CRUX

Once you have downloaded and converted the required databases (section above), you can run CRUX.

CRUX generates taxonomic reference libraries by querying your primers against an ecoPCR database you generated in Step 3. Anacapa then uses these libraries for taxonomic assignment of your sequences.

Example command:

Code Block
/bin/bash ~/Crux/crux_db/crux.sh -n 12S -f GTCGGTAAAACTCGTGCCAGC -r CATAGTGGGGTATCTAATCCCAGTTTG -s 80 -m 280 -o ~/Crux/crux_db/12S -d ~/Crux/crux_db -l

The -s and -m parameters indicate the shortest and longest expected amplicons respectively. -n is the name of the primer set. -f and -r are the forward and reverse primers. -o is the output directory and -d is the directory location containing subdirectories of the CRUX databases you generated previously (NCBI taxonomy, obiconvert results, NCBI accession2taxonomy) .

See here for more details on what tools and steps are run in this section:

GitHub - limey-bean/CRUX_Creating-Reference-libraries-Using-eXisting-tools

Create a subdirectory, under your main anacapa directory, to output the Anacapa results. In this example we’re running the test 16S mammal primers/databases, so we’ll call the output directory ‘16mam’. Change this to a name suitable for your dataset (and also change 'your_home_directory' to your actual home directory).

Code Block
cd /home/your_home_directory/anacapa
mkdir 16Smam

Modifying your CRUX config script

You’ll need to change some lines in your ‘crux_config.sh’ file, so that this config file points to the correct locations of tools and databases.

Your ‘crux_config.sh’ file should be in: /home/your_home_directoryAnacapa/CRUX_Creating-Reference-libraries-Using-eXisting-tools/crux_db/scripts

I have attached a copy of a working crux_config.sh file below.

View file
namecrux_config.sh

First, back up your current crux_config.sh file like so (make sure you’re in the directory where the ‘crux_config.sh’ file is):

Code Block
mv crux_config.sh crux_config.sh_bak

Now copy the above attached crux_config.sh file to that directory.

There is one line you’ll still need to manually modify. Open crux_config.sh in nano:

Code Block
nano crux_config.sh

And change the BLAST_DB="/home/whatmorp/nextflow/pia_eDNAFlow/db/nt" line to where you downloaded your NCBI nt library (see the ‘Download the NCBI nt library’ section in this guide).

Control-o to save the file and control-x to exit nano.

Running CRUX

We have added all the required tools to a singularity image, so run the CRUX command using this singularity image.

Code Block
singularity exec docker://ghcr.io/eresearchqut/anacapa-image:0.0.3 /bin/bash /home/your_home_directory/anacapa/crux_db/crux.sh -n 16Smam -f CGGTTGGGGTGACCTCGGA -r GCTGTTATCCCTAGGGTAACT -s 40 -m 240 -o /home/your_home_directory/anacapa/16Smam -d /home/your_home_directory/anacapa/crux_db/ -l

As before, change all instances of 'your_home_directory' to your actual home directory.

Change the name (-n) to your primer set name. This should be the same name as your output directory.

Change your -f and -r to your primer sequences.

Change -s and -m to Find the expected length of your amplicons (should be in the literature associated with your primer set) and make -s 100bp smaller and -m 100bp longer than this length. E.g. our 16Smam test primers amplify a product approx. 140bp in length, so we use -s 40 -m 240

Change the -o output directory to the directory location you created at the start of this section.

The -d should point to where your CRUX databases are. Check in this directory. You should see subdirectories containing NCBI taxonomy, obiconvert results, NCBI accession2taxonomy databases (see ‘Step 3: Create reference libraries using CRUX’ to see where you created these databases).

Code Block
#!/bin/bash -l
#PBS -N CRUX
#PBS -l select=1:ncpus=2:mem=64gb
#PBS -l walltime=24:00:00

cd $PBS_O_WORKDIR

singularity exec docker://ghcr.io/eresearchqut/anacapa-image:0.0.3 /bin/bash \
/home/your_home_directory/anacapa/crux_db/crux.sh \
-n 16Smam -f CGGTTGGGGTGACCTCGGA -r GCTGTTATCCCTAGGGTAACT \
-s 40 -m 240 \
-o /home/your_home_directory/anacapa/16Smam \
-d /home/your_home_directory/anacapa/crux_db/ -l

Step 5: Running anacapa

Now that the CRUX databases have been constructed, we can run anacapa itself on these databases.

This constitutes 2 steps (steps 5 and 6 in this guide).

GitHub - limey-bean/Anacapa

First (this section, section 5) we run:

..sequence QC and generate amplicon sequence variants (ASV) from Illumina data using dada2 (Callahan et al. 2016). ASVs are a novel solution to identifying biologically informative unique sequences in metabarcoding samples that replaces the operational taxonomic unit (OTU) framework. Unlike OTUs, which cluster sequences using an arbitrary sequence similarity (ex 97%), ASVs are unique sequence reads determined using Bayesian probabilities of known sequencing error. These unique sequences can be as little as 2 bp different, providing improved taxonomic resolution and an increase in observed diversity. Please see (Callahan et al. 2016, Amir et al. 2017) for further discussion.

...

Example anacapa script:

Code Block
/bin/bash ~/Anacapa_db/anacapa_QC_dada2.sh -i <input_dir> -o <out_dir> -d <database_directory> -a <adapter type (nextera or truseq)> -t <illumina run type HiSeq or MiSeq> -l

Required arguments:

-i      path to .fastq.gz files, if files are already uncompressed use -g

-o      path to output directory

-d      path to the CRUX database you generated in the previous section..

-a      Illumina adapter type: nextera, truseq, or NEBnext

-t     Illumina Platform: HiSeq (2 x 150) or MiSeq (>= 2 x 250)

Code Block
singularity exec /home/whatmorp/nextflow/pia_eDNAFlow/Anacapa/anacapa-1.5.0.img /bin/bash /home/whatmorp/nextflow/pia_eDNAFlow/Anacapa/anacapa/Anacapa_db/anacapa_QC_dada2.sh -i /home/whatmorp/nextflow/pia_eDNAFlow/fastq -o /home/whatmorp/nextflow/pia_eDNAFlow/Anacapa/16Smam_anacapa_output -d /home/whatmorp/nextflow/pia_eDNAFlow/Anacapa/anacapa/Anacapa_db -a nextera -t MiSeq -g -l

Step 6: Running anacapa classifier

Example:

Code Block
/bin/bash ~/Anacapa_db/anacapa_classifier.sh -o <out_dir_for_anacapa_QC_run> -d <database_directory> -u <hoffman_account_user_name> -l

Required Arguments:

        -o      path to output directory generated in the Sequence QC and ASV Parsing script

        -d      path to Anacapa_db

Code Block
singularity exec /home/whatmorp/nextflow/pia_eDNAFlow/Anacapa/anacapa-1.5.0.img /bin/bash /home/whatmorp/nextflow/pia_eDNAFlow/Anacapa/anacapa/Anacapa_db/anacapa_classifier.sh -o /home/whatmorp/nextflow/pia_eDNAFlow/Anacapa/16Smam_anacapa_output -d /home/whatmorp/nextflow/pia_eDNAFlow/Anacapa/anacapa/Anacapa_db -l

chmod 777 /home/whatmorp/nextflow/pia_eDNAFlow/Anacapa/16Smam_anacapa_output/Run_info/run_scripts/16Smam_bowtie2_blca_job.sh

Cleanup

Running the anacapa workflow involves downloading and generating various large databases. These will just take up space on the HPC unless removed.

If you will be running more samples on these databases in the near future you can retain them, otherwise they should be removed.