Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Anacapa is a toolkit designed to construct reference databases and assign taxonomy, from eDNA sequences.

...

An overview of HPC commands and usage, as well as a link for requesting access to the HPC (if you don’t currently have a HPC account) is here:

Start using the HPC

There are plenty of online guides that teach basic Linux command line usage, for example:

...

The details of creating and submitting a PBS script can be found here:

Start using the HPC

If you’re testing several tools or running multiple separate commands then an interactive PBS session may be preferable. Below is the command to create an interactive PBS session with 8 CPUs, 64GB memory and a maximum running time of 11 hours (12 hours is the absolute maximum that can be requested for an interactive session).

...

Code Block
mkdir ~/anacapa/crux_db/EMBL
cd ~/anacapa/crux_db/EMBL
wget https://www.funet.fi/pub/sci/molbio/embl_release/std/rel_std_mam_01_r143.dat.gz
wget https://www.funet.fi/pub/sci/molbio/embl_release/std/rel_std_mam_02_r143.dat.gz
gzip -d rel_std_mam_01_r143.dat.gz
gzip -d rel_std_mam_02_r143.dat.gz

Again, in this guide we’re just looking at mammal sequences. If you’re looking at another taxonomic group, you’ll need to download the appropriate databases. below are the codes for the available EMBL taxonomic groups.

Division Code ---------------- ------------------ Bacteriophage
Code Block
Code Block
#alternatively download all EMBL files
wget https://www.funet.fi/pub/sci/molbio/embl_release/std/rel*

#uncompress all downloaded files
for i in `ls rel*`; do echo $i; gzip -d $i; done

Again, in this guide we’re just looking at mammal sequences. If you’re looking at another taxonomic group, you’ll need to download the appropriate databases. below are the codes for the available EMBL taxonomic groups.

Code Block
Division         PHG - common Environmental Sample     ENV Code
---------------- common Fungal        ------------------
Bacteriophage            PHG - common
Environmental Sample     ENV - common
Fungal                   FUN - map to PLN (plants + fungal)
Human                    HUM - map to PRI (primates)
Invertebrate             INV - common
Other Mammal             MAM - common
Other Vertebrate         VRT - common
Mus musculus             MUS - map to ROD (rodent)
Plant                    PLN - common
Prokaryote               PRO - map to BCT (poor name)
Other Rodent             ROD - common
Synthetic                SYN - common
Transgenic               TGN - ??? map to SYN ???
Unclassified             UNC - map to UNK
Viral                    VRL - common

...

Run the obiconvert command from the anacapa Singularity image.

Important: You need to change every instance of /home/your_home_directory in the below command to your actual home directory (this is because obiconvert requires absolute paths). To find your home directory path, type cd ~ and then pwd. Use the path that this displays to replace the /home/your_home_directory.

Code Block
singularity exec /home/your_home_directory/anacapa/anacapa-1.5.0.img obiconvert -t /home/your_home_directory/anacapa/crux_db/TAXO --embl --ecopcrdb-output=/home/your_home_directory/anacapa/crux_db/Obitools_databases/OB_dat_EMBL_std/OB_dat_EMBL_std /home/your_home_directory/anacapa/crux_db/EMBL/*.dat --skip-on-error

The above obiconvert command uses the NCBI MAKE SURE THE OUTPUT DIRECTORY IS EMPTY (--ecopcrdb-output= ...). If you’ve previously run this obiconvert command (as a test, or if it failed) using this same output directory, there may be some leftover files in there, in which case obiconvert won’t overwrite them, but will sequentially add to the database.

The above obiconvert command uses the NCBI taxonomy database (downloaded to ~/anacapa/crux_db/TAXO) and the EMBL database (downloaded to ~/anacapa/crux_db/EMBL/*.dat) and it outputs the ecoPCR converted database to /Obitools_databases/OB_dat_EMBL_std/ and prepends the generated ecoPCR database files with OB_dat_EMBL_std....

...

During initial testing on the mammal EMBL databases, this took about 8 hours to complete. Note that a PBS interactive session has a maximum time limit of 12 hours (and we requested 11 hours when we started our session). If you are working with a larger dataset - e.g. vertebrates or invertebrates - this process may take much longer, and in fact longer than an interactive session will run, requiring you to submit the above obiconvert command as a PBS script (again, see Start using the HPC for instructions on how to do this).

...

Code Block
#!/bin/bash -l
#PBS -N ObiRun
#PBS -l select=1:ncpus=2:mem=64gb
#PBS -l walltime=96:00:00

cd $PBS_O_WORKDIR

singularity exec /home/your_home_directory/anacapa/anacapa-1.5.0.img \

obiconvert \
-t /home/your_home_directory/anacapa/crux_db/TAXO \
--embl \
--ecopcrdb-output=/home/your_home_directory/anacapa/crux_db/Obitools_databases/OB_dat_EMBL_std/OB_dat_EMBL_std \
--skip-on-error \
/home/your_home_directory/anacapa/crux_db/EMBL/*.dat

...

The -d should point to where your CRUX databases are. Check in this directory. You should see subdirectories containing NCBI taxonomy, obiconvert results, NCBI accession2taxonomy databases (see ‘Step 3: Create reference libraries using CRUX’ to see where you created these databases).

Step 5: Running anacapa

Now that the CRUX databases have been constructed, we can run anacapa itself on these databases.

This constitutes 2 steps (steps 5 and 6 in this guide).

GitHub - limey-bean/Anacapa

First (this section, section 5) we run:

...

Code Block
#!/bin/bash -l
#PBS -N CRUX
#PBS -l select=1:ncpus=2:mem=64gb
#PBS -l walltime=24:00:00

cd $PBS_O_WORKDIR

singularity exec docker://ghcr.io/eresearchqut/anacapa-image:0.0.3 /bin/bash \
/home/your_home_directory/anacapa/crux_db/crux.sh \
-n 16Smam -f CGGTTGGGGTGACCTCGGA -r GCTGTTATCCCTAGGGTAACT \
-s 40 -m 240 \
-o /home/your_home_directory/anacapa/16Smam \
-d /home/your_home_directory/anacapa/crux_db/ -l

Step 5: Running anacapa

Now that the CRUX databases have been constructed, we can run anacapa itself on these databases.

This constitutes 2 steps (steps 5 and 6 in this guide).

GitHub - limey-bean/Anacapa

First (this section, section 5) we run:

..sequence QC and generate amplicon sequence variants (ASV) from Illumina data using dada2 (Callahan et al. 2016). ASVs are a novel solution to identifying biologically informative unique sequences in metabarcoding samples that replaces the operational taxonomic unit (OTU) framework. Unlike OTUs, which cluster sequences using an arbitrary sequence similarity (ex 97%), ASVs are unique sequence reads determined using Bayesian probabilities of known sequencing error. These unique sequences can be as little as 2 bp different, providing improved taxonomic resolution and an increase in observed diversity. Please see (Callahan et al. 2016, Amir et al. 2017) for further discussion.

...

Example anacapa script:

...

providing improved taxonomic resolution and an increase in observed diversity. Please see (Callahan et al. 2016, Amir et al. 2017) for further discussion.

...

Example anacapa script:

Code Block
/bin/bash ~/Anacapa_db/anacapa_QC_dada2.sh -i <input_dir> -o <out_dir> -d <database_directory> -a <adapter type (nextera or truseq)> -t <illumina run type HiSeq or MiSeq> -l

Required arguments:

-i      path to .fastq.gz files, if files are already uncompressed use -g

-o      path to output directory

-d      path to the CRUX database you generated in the previous section..

-a      Illumina adapter type: nextera, truseq, or NEBnext

-t     Illumina Platform: HiSeq (2 x 150) or MiSeq (>= 2 x 250)

Code Block
singularity exec /home/whatmorp/nextflow/pia_eDNAFlow/Anacapa/anacapa-1.5.0.img /bin/bash /home/whatmorp/nextflow/pia_eDNAFlow/Anacapa/anacapa/Anacapa_db/anacapa_QC_dada2.sh -i /home/whatmorp/nextflow/pia_eDNAFlow/fastq -o /home/whatmorp/nextflow/pia_eDNAFlow/Anacapa/16Smam_anacapa_output -d /home/whatmorp/nextflow/pia_eDNAFlow/Anacapa/anacapa/Anacapa_db -a nextera -t MiSeq -g -l

Step 6: Running anacapa classifier

Example:

Code Block
/bin/bash ~/Anacapa_db/anacapa_classifier.sh -o <out_dir_for_anacapa_QC_run> -d <database_directory> -u <hoffman_account_user_name> -l

Required Arguments:

        -o      path to output directory generated in the Sequence QC and ASV Parsing script

        -d      path to Anacapa_db

Code Block
singularity exec /home/whatmorp/nextflow/pia_eDNAFlow/Anacapa/anacapa-1.5.0.img /bin/bash /home/whatmorp/nextflow/pia_eDNAFlow/Anacapa/anacapa/Anacapa_db/anacapa_

...

classifier.sh -

...

Required arguments:

-i      path to .fastq.gz files, if files are already uncompressed use -g

-o      path to output directory

-d      path to the CRUX database you generated in the previous section..

-a      Illumina adapter type: nextera, truseq, or NEBnext

-t     Illumina Platform: HiSeq (2 x 150) or MiSeq (>= 2 x 250)

o /home/whatmorp/nextflow/pia_eDNAFlow/Anacapa/16Smam_anacapa_output -d /home/whatmorp/nextflow/pia_eDNAFlow/Anacapa/anacapa/Anacapa_db -l

chmod 777 /home/whatmorp/nextflow/pia_eDNAFlow/Anacapa/16Smam_anacapa_output/Run_info/run_scripts/16Smam_bowtie2_blca_job.sh

Cleanup

Running the anacapa workflow involves downloading and generating various large databases. These will just take up space on the HPC unless removed.

...