Anacapa is a toolkit designed to construct reference databases and assign taxonomy, from eDNA sequences.
...
Run the obiconvert
command from the anacapa Singularity image.
Important: You need to change every instance of /home/your_home_directory
in the below command to your actual home directory (this is because obiconvert requires absolute paths). To find your home directory path, type cd ~
and then pwd
. Use the path that this displays to replace the /home/your_home_directory
.
...
During initial testing on the mammal EMBL databases, this took about 8 hours to complete. Note that a PBS interactive session has a maximum time limit of 12 hours (and we requested 11 hours when we started our session). If you are working with a larger dataset - e.g. vertebrates or invertebrates - this process may take much longer, and in fact longer than an interactive session will run, requiring you to submit the above obiconvert
command as a PBS script (again, see HPC for instructions on how to do this).
Step 4: Running CRUX
Once you have downloaded and converted the required databases (section above), you can run CRUX.
CRUX generates taxonomic reference libraries by querying your primers against an ecoPCR database you generated in Step 3. Anacapa then uses these libraries for taxonomic assignment of your sequences.An example PBS script for running obitools can be seen below.
Code Block |
---|
#!/bin/bash -l
#PBS -N ObiRun
#PBS -l select=1:ncpus=2:mem=64gb
#PBS -l walltime=96:00:00
cd $PBS_O_WORKDIR
singularity exec /home/your_home_directory/anacapa/anacapa-1.5.0.img \
obiconvert \
-t /home/your_home_directory/anacapa/crux_db/TAXO \
--embl \
--ecopcrdb-output=/home/your_home_directory/anacapa/crux_db/Obitools_databases/OB_dat_EMBL_std/OB_dat_EMBL_std \
--skip-on-error \
/home/your_home_directory/anacapa/crux_db/EMBL/*.dat |
As before, you’ll need to change the above directory locations to match where your singularity image is, your taxonomy database, your output directory and your EMBL database.
To create this script you can use a text editor like nano. In your HPC command line, type:
Code Block |
---|
module load nano |
To load nano, then type:
Code Block |
---|
nano launch.pbs |
This will create an empty PBS script file called ‘launch.pbs’. Copy and paste the PBS script text from the code block above into nano, then press control and o to save the file, then control and x to exit nano.
Now you can launch this as a PBS job by typing:
Code Block |
---|
qsub launch.pbs |
Your job will be added to the queue, so may take some time to start if there are many jobs queued. You can check the status of your jobs by typing:
Code Block |
---|
qstat -u <username> |
Change <username> to your own user (logon) name.
Step 4: Running CRUX
Once you have downloaded and converted the required databases (section above), you can run CRUX.
CRUX generates taxonomic reference libraries by querying your primers against an ecoPCR database you generated in Step 3. Anacapa then uses these libraries for taxonomic assignment of your sequences.
Example command:
Code Block |
---|
/bin/bash ~/Crux/crux_db/crux.sh -n 12S -f GTCGGTAAAACTCGTGCCAGC -r CATAGTGGGGTATCTAATCCCAGTTTG -s 80 -m 280 -o ~/Crux/crux_db/12S -d ~/Crux/crux_db -l |
The -s and -m parameters indicate the shortest and longest expected amplicons respectively. -n is the name of the primer set. -f and -r are the forward and reverse primers. -o is the output directory and -d is the directory location containing subdirectories of the CRUX databases you generated previously (NCBI taxonomy, obiconvert results, NCBI accession2taxonomy) .
See here for more details on what tools and steps are run in this section:
GitHub - limey-bean/CRUX_Creating-Reference-libraries-Using-eXisting-tools
Create a subdirectory, under your main anacapa directory, to output the Anacapa results. In this example we’re running the test 16S mammal primers/databases, so we’ll call the output directory ‘16mam’. Change this to a name suitable for your dataset (and also change 'your_home_directory
' to your actual home directory).
Code Block |
---|
cd /home/your_home_directory/anacapa
mkdir 16Smam |
Modifying your CRUX config script
You’ll need to change some lines in your ‘crux_config.sh’ file, so that this config file points to the correct locations of tools and databases.
Your ‘crux_config.sh’ file should be in: /home/your_home_directoryAnacapa/CRUX_Creating-Reference-libraries-Using-eXisting-tools/crux_db/scripts
I have attached a copy of a working crux_config.sh file below.
View file | ||
---|---|---|
|
First, back up your current crux_config.sh file like so (make sure you’re in the directory where the ‘crux_config.sh’ file is):
Code Block |
---|
mv crux_config.sh crux_config.sh_bak |
Now copy the above attached crux_config.sh file to that directory.
There is one line you’ll still need to manually modify. Open crux_config.sh in nano:
Code Block |
---|
nano crux_config.sh |
And change the BLAST_DB="/home/whatmorp/nextflow/pia_eDNAFlow/db/nt" line to where you downloaded your NCBI nt library (see the ‘Download the NCBI nt library’ section in this guide).
Control-o to save the file and control-x to exit nano.
Running CRUX
We have added all the required tools to a singularity image, so run the CRUX command using this singularity image.
Code Block |
---|
singularity exec docker://ghcr.io/eresearchqut/anacapa-image:0.0.3 /bin/bash /home/your_home_directory/anacapa/crux_db/crux.sh -n 16Smam -f CGGTTGGGGTGACCTCGGA -r GCTGTTATCCCTAGGGTAACT -s 40 -m 240 -o /home/your_home_directory/anacapa/16Smam -d /home/your_home_directory/anacapa/crux_db/ -l |
As before, change all instances of 'your_home_directory
' to your actual home directory.
Change the name (-n
) to your primer set name. This should be the same name as your output directory.
Change your -f
and -r
to your primer sequences.
Change -s
and -m
to Find the expected length of your amplicons (should be in the literature associated with your primer set) and make -s
100bp smaller and -m
100bp longer than this length. E.g. our 16Smam test primers amplify a product approx. 140bp in length, so we use -s
40 -m
240
Change the -o
output directory to the directory location you created at the start of this section.
The -d
should point to where your CRUX databases are. Check in this directory. You should see subdirectories containing NCBI taxonomy, obiconvert results, NCBI accession2taxonomy databases (see ‘Step 3: Create reference libraries using CRUX’ to see where you created these databases).
Step 5: Running anacapa
Now that the CRUX databases have been constructed, we can run anacapa itself on these databases.
This constitutes 2 steps (steps 5 and 6 in this guide).
First (this section, section 5) we run:
..sequence QC and generate amplicon sequence variants (ASV) from Illumina data using dada2 (Callahan et al. 2016). ASVs are a novel solution to identifying biologically informative unique sequences in metabarcoding samples that replaces the operational taxonomic unit (OTU) framework. Unlike OTUs, which cluster sequences using an arbitrary sequence similarity (ex 97%), ASVs are unique sequence reads determined using Bayesian probabilities of known sequencing error. These unique sequences can be as little as 2 bp different, providing improved taxonomic resolution and an increase in observed diversity. Please see (Callahan et al. 2016, Amir et al. 2017) for further discussion.
...
Example anacapa script:
Code Block |
---|
/bin/bash ~/Anacapa_db/anacapa_QC_dada2.sh -i <input_dir> -o <out_dir> -d <database_directory> -a <adapter type (nextera or truseq)> -t <illumina run type HiSeq or MiSeq> -l |
Required arguments:
-i path to .fastq.gz files, if files are already uncompressed use -g
-o path to output directory
-d path to the CRUX database you generated in the previous section..
-a Illumina adapter type: nextera, truseq, or NEBnext
-t Illumina Platform: HiSeq (2 x 150) or MiSeq (>= 2 x 250)
Cleanup
Running the anacapa workflow involves downloading and generating various large databases. These will just take up space on the HPC unless removed.
...