Content Comparison

Table of Contents

...

You should see several output directories and files have been created in your ‘ampliseq_test’ directory. These contain the test analysis results. Have a look through these, as they are similar to the output from a full ampliseq run (i.e. on your dataset).

Need instructions on setting up NextFlow tower

Q for Craig:

Do we need to add any of this to .nextflow/config file? Perhaps just for Tower?

process {
executor = 'pbspro'
scratch = 'true'
beforeScript = {
"""
mkdir -p /data1/whatmorp/singularity/mnt/session
source $HOME/.bashrc
source $HOME/.profile
"""
}
}

...

As with the test run, you will need to download some datafiles and create some new files (manifest file, metadata file, nextflow.config file) to get ampliseq running on the HPC.

Taxonomic database

Download the silva database. This is the main database ampliseq uses for taxonomic classification.

...

NOTE: be very careful about the naming and structure of these files. Sample IDs in the manifest and metadata files must match exactly and the file paths need to be correct. Column names must be named exactly as in the examples below (including case). Spelling errors, a stray comma or other character in these files is one of the more common reasons for ampliseq to fail

Taxonomic database

Download the silva database. This is the main database ampliseq uses for taxonomic classification.

Code Block
wget https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip

...

sampleID	forwardReads	reverseReads
groupA_1	/home/myproject/fastq/sample1_S22_L001_R1.fastq.gz	/home/myproject/fastq/sample1_S22_L001_R2.fastq.gz
groupA_2	/home/myproject/fastq/sample2_S23_L001_R1.fastq.gz	/home/myproject/fastq/sample2_S23_L001_R2.fastq.gz
etc…

...

Creating a manifest file at the command line

As mentioned above, spelling mistakes or extra characters in the file paths will cause ampliseq to fail. One way to avoid this is to generate the manifest file on the command line using the tools awk and sed.

Below is an example of how to generate the manifest file. You may need to modify this, depending on how your files are named.

To create the manifest using awk, paste, sed:

List all the fastq files in the directory (both read pairs)

Code Block
ls _R1.fastq.gz -lh \| awk '{print $9}' > read1

...



ls *_R2*.fastq.gz  -lh | awk '{print $9}' > read2

2. List the sample IDs. If the sample names are in the sample files, they can be extracted using sed. For example:

Code Block
cat read1 \| sed 's/_S.*//' > ID

The sample file names in this case are like such: ‘Raw8h_S10_L001_R1_001.fastq.gz’

The sample ID is ‘Raw8h’. The above sed command strips the characters after ‘_S’, leaving just the ID name. Depending on how your sample files are named, you can create a list of your sample IDs by modifying the above sed command.

3. Paste these together with the sample file directory prepended and tab delimiters. Output as ‘manifest.txt’.

Code Block
paste ID read1 read2 \| awk '{print $1 "\t" "/path/to/your/nextflow/myproject/fastq/" $2 "\t" "/path/to/your/nextflow/myproject/fastq/" $3}' > manifest.txt

Make sure you them manually add the 3 column names at the top of each column: ‘sampleID’, ‘forwardReads’ and ‘reverseReads’ (e.g. use a text editor like nano, or download the file and modify it in Excel, then re-upload it).

Finally, copy the created manifest.txt to the directory where you will be running ampliseq from.

Metadata file

This is a tab separated values file (.tsv) that is required by QIIME2 to compare taxonomic diversity with phenotype (e.g. how diversity varies per experimental treatment). It contains the same sample IDs found in the manifest file and a column for each category of metadata you have for the samples. This may include sequence barcodes, experimental treatment group (e.g. high fat vs low fat) and any other measurements taken, such as age, date collected, tissue type, sex, collection location, weight, length, etc, etc, etc). QIIME2 will compare every metadata column with taxonomic results, then calculate and plot correlations and diversity indices. See here for more details:

...

Code Block

params {
    max_cpus=32
    max_memory=512.GB
    max_time = 48.h
    FW_primer = "TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAGCCTACGGGNGGCWGCAG"
    RV_primer = "GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCCGACTACHVGGGTATCTAATCC"
    metadata = "metadata.txt"
    manifest = "manifest.txt"
    reference_database = "Silva_132_release.zip"
    retain_untrimmed = true
}

...

If you haven’t been set up or have used the HPC previously, click on this link for information on how to get access to and use the HPC:

Need a link here for HPC access and usage

Creating a shared workspace on the HPC

...

To request a node using PBS, submit a shell script containing your RAM/CPU/analysis time requirements and the code needed to run your analysis. For an overview of submitting a PBS job, see here:

Need a link here for creating PBS jobs

Alternatively, you can start up an ‘interactive’ node, using the following:

...

Version	Old Version 12	New Version 13
Changes made by	Paul Whatmore (Deactivated)	Paul Whatmore (Deactivated)
Saved on	Jan 28, 2021	Jan 28, 2021

Versions Compared

Key

Taxonomic database

Taxonomic database

Creating a manifest file at the command line

Metadata file

Creating a shared workspace on the HPC