Table of Contents |
---|
...
You should see several output directories and files have been created in your ‘ampliseq_test’ directory. These contain the test analysis results. Have a look through these, as they are similar to the output from a full ampliseq run (i.e. on your dataset).
Need instructions on setting up NextFlow tower
Q for Craig:
Do we need to add any of this to .nextflow/config file? Perhaps just for Tower?
process {
executor = 'pbspro'
scratch = 'true'
beforeScript = {
"""
mkdir -p /data1/whatmorp/singularity/mnt/session
source $HOME/.bashrc
source $HOME/.profile
"""
}
}
...
As with the test run, you will need to download some datafiles and create some new files (manifest file, metadata file, nextflow.config file) to get ampliseq running on the HPC.
Taxonomic database
Download the silva database. This is the main database ampliseq uses for taxonomic classification.
...
NOTE: be very careful about the naming and structure of these files. Sample IDs in the manifest and metadata files must match exactly and the file paths need to be correct. Column names must be named exactly as in the examples below (including case). Spelling errors, a stray comma or other character in these files is one of the more common reasons for ampliseq to fail
Taxonomic database
Download the silva database. This is the main database ampliseq uses for taxonomic classification.
Code Block |
---|
wget https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip |
...
sampleID | forwardReads | reverseReads |
---|---|---|
groupA_1 | /home/myproject/fastq/sample1_S22_L001_R1.fastq.gz | /home/myproject/fastq/sample1_S22_L001_R2.fastq.gz |
groupA_2 | /home/myproject/fastq/sample2_S23_L001_R1.fastq.gz | /home/myproject/fastq/sample2_S23_L001_R2.fastq.gz |
etc… |
...
Creating a manifest file at the command line
As mentioned above, spelling mistakes or extra characters in the file paths will cause ampliseq to fail. One way to avoid this is to generate the manifest file on the command line using the tools awk and sed.
Below is an example of how to generate the manifest file. You may need to modify this, depending on how your files are named.
To create the manifest using awk, paste, sed:
List all the fastq files in the directory (both read pairs)
Code Block |
---|
ls *_R1*.fastq.gz -lh | awk '{print $9}' > read1 |
...
ls *_R2*.fastq.gz -lh | awk '{print $9}' > read2 |
2. List the sample IDs. If the sample names are in the sample files, they can be extracted using sed. For example:
Code Block |
---|
cat read1 | sed 's/_S.*//' > ID |
The sample file names in this case are like such: ‘Raw8h_S10_L001_R1_001.fastq.gz’
The sample ID is ‘Raw8h’. The above sed command strips the characters after ‘_S’, leaving just the ID name. Depending on how your sample files are named, you can create a list of your sample IDs by modifying the above sed command.
3. Paste these together with the sample file directory prepended and tab delimiters. Output as ‘manifest.txt’.
Code Block |
---|
paste ID read1 read2 | awk '{print $1 "\t" "/path/to/your/nextflow/myproject/fastq/" $2 "\t" "/path/to/your/nextflow/myproject/fastq/" $3}' > manifest.txt |
Make sure you them manually add the 3 column names at the top of each column: ‘sampleID’, ‘forwardReads’ and ‘reverseReads’ (e.g. use a text editor like nano, or download the file and modify it in Excel, then re-upload it).
Finally, copy the created manifest.txt to the directory where you will be running ampliseq from.
Metadata file
This is a tab separated values file (.tsv) that is required by QIIME2 to compare taxonomic diversity with phenotype (e.g. how diversity varies per experimental treatment). It contains the same sample IDs found in the manifest file and a column for each category of metadata you have for the samples. This may include sequence barcodes, experimental treatment group (e.g. high fat vs low fat) and any other measurements taken, such as age, date collected, tissue type, sex, collection location, weight, length, etc, etc, etc). QIIME2 will compare every metadata column with taxonomic results, then calculate and plot correlations and diversity indices. See here for more details:
...
Code Block |
---|
params { max_cpus=32 max_memory=512.GB max_time = 48.h FW_primer = "TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAGCCTACGGGNGGCWGCAG" RV_primer = "GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCCGACTACHVGGGTATCTAATCC" metadata = "metadata.txt" manifest = "manifest.txt" reference_database = "Silva_132_release.zip" retain_untrimmed = true } |
...
If you haven’t been set up or have used the HPC previously, click on this link for information on how to get access to and use the HPC:
Need a link here for HPC access and usage
Creating a shared workspace on the HPC
...
To request a node using PBS, submit a shell script containing your RAM/CPU/analysis time requirements and the code needed to run your analysis. For an overview of submitting a PBS job, see here:
Need a link here for creating PBS jobs
Alternatively, you can start up an ‘interactive’ node, using the following:
...