Table of Contents |
---|
...
This will submit multiple PBS jobs, for each test file and analysis step. How fast this will run depends on how large the queue is for using the HPC. If there is no queue (rare) the test run will finish in approx 30 minutes. regardless, this does not need to be actively monitored. You can check later to see if the test run succeeded or failed.
Ampliseq output
As can be seen in the test results (see above section), ampliseq produces several directories, containing various analyses outputs. These are outlined here:
https://nf-co.re/ampliseq/1.1.3/output
Briefly, these include quality control and a variety of taxonomic diversity and abundance analyses, using QIIME2 and DADA2 as core analysis tools.
These directories contain various tables and figures that can be used in either downstream analysis or directly in publications.
Running NextFlow’s ampliseq pipeline
Mahsa’s data…:
Downloaded Silva Db from: https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip
Nextflow.config params:
params {
max_cpus=32
max_memory=512.GB
max_time = 48.h
input="/work/rumen_16S/nextflow/16S/fastq"
extension="/*.fastq.gz"
FW_primer = "TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG"
RV_primer = "GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC"
metadata = "metadata.txt"
reference_database = "/home/whatmorp/Annette/16S/Silva_132_release.zip"
retain_untrimmed = true
}
Pointing to the place I DL Silva DB to originally
Pointing to Mahsa’s fastq files dir
metadata.txt file and manifest.tst files already created (need to add details here)
Running with
Code Block |
---|
nextflow run nf-core/ampliseq -profile singularity --manifest manifest.txt |
Q for Craig:
Do we need to add any of this to .nextflow/config file? Perhaps just for Tower?
process {
executor = 'pbspro'
scratch = 'true'
beforeScript = {
"""
mkdir -p /data1/whatmorp/singularity/mnt/session
source $HOME/.bashrc
source $HOME/.profile
"""
}
}
singularity {
cacheDir = '/home/whatmorp/NXF_SINGULARITY_CACHEDIR'
autoMounts = true
}
conda {
cacheDir = '/home/whatmorp/NXF_CONDA_CACHEDIR'
}
tower {
accessToken = 'c7f8cc62b24155c0150a6ce4b6db15946bfc19ef'
endpoint = 'https://nftower.qut.edu.au/api'
enabled = true
}
Installing NextFlow
Follow the eResearch wiki entry for installing NextFlow:
Basic pipeline
This section provides a skeleton overview of the pipeline commands. if you’ve installed the tools and run this pipeline before, you can simply run these commands on your data. if this is your first time running this analysis, it’s strongly recommended you read the following sections of this article, as they provide important background information.
Analysis overview
Pipeline overview
This section provides a skeleton overview of the pipeline commands. if you’ve run this pipeline before and understand the following sections, you can simply run these commands on your data. if this is your first time running this analysis, it’s strongly recommended you read the following sections of this article, as they provide important background information.If it succeeded (you will see ‘Pipeline completed successfully’ message), continue with the analysis of your dataset. if it failed (multiple potential error messages), contact us at eResearch and we will work through the issue: eresearch@qut.edu.au.
You should see several output directories and files have been created in your ‘ampliseq_test’ directory. These contain the test analysis results. Have a look through these, as they are similar to the output from a full ampliseq run (i.e. on your dataset).
Need instructions on setting up NextFlow tower
Q for Craig:
Do we need to add any of this to .nextflow/config file? Perhaps just for Tower?
process {
executor = 'pbspro'
scratch = 'true'
beforeScript = {
"""
mkdir -p /data1/whatmorp/singularity/mnt/session
source $HOME/.bashrc
source $HOME/.profile
"""
}
}
singularity {
cacheDir = '/home/whatmorp/NXF_SINGULARITY_CACHEDIR'
autoMounts = true
}
conda {
cacheDir = '/home/whatmorp/NXF_CONDA_CACHEDIR'
}
tower {
accessToken = 'c7f8cc62b24155c0150a6ce4b6db15946bfc19ef'
endpoint = 'https://nftower.qut.edu.au/api'
enabled = true
}
Ampliseq output
As can be seen in the test results (see above section), ampliseq produces a ‘results’ directory with several subdirectories, which contain various analyses outputs. These are outlined here:
https://nf-co.re/ampliseq/1.1.3/output
Briefly, these include quality control and a variety of taxonomic diversity and abundance analyses, using QIIME2 and DADA2 as core analysis tools.
These directories contain various tables and figures that can be used in either downstream analysis or directly in publications.
Running ampliseq on your dataset
In this section we will focus primarily on the commands and files you need to run the pipeline on your data. A complete description of the ampliseq pipeline is on the ampliseq websites. To properly understand the ampliseq processes and analysis outputs, it is advisable that you thoroughly read through these.
https://github.com/nf-core/ampliseq
Ampliseq requires the creation of some additional files and modification of parameters files in order to run on your dataset. Instructions below.
Directory structure and files
Make sure you have created a subdirectory in your ‘nextflow’ directory. Give it a meaningful name (e.g. mkdir <yourprojectname>_nextflow
. Make sure you are in that directory (cd ~/nextflow/<yourprojectname>_nextflow
).
As with the test run, you will need to download some datafiles and create some new files (manifest file, metadata file, nextflow.config file) to get ampliseq running on the HPC.
Taxonomic database
Download the silva database. This is the main database ampliseq uses for taxonomic classification.
Code Block |
---|
wget https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip |
Manifest file
In the test run, a list of your filenames and the associated sample ID was included in the nextflow.config file. With many sample files it’s much easier to include this information in a separate manifest file.
This is a tab delimited file that contains 3 columns:
‘sampleID’, with the sample IDs. You can call these whatever you like, but it should be meaningful (e.g. groupA_1, groupA_2, groupb_1, groupB_2, etc)
‘forwardReads’. The full path for the forward reads. e.g. /home/myproject/fastq/sample1_S22_L001_R1.fastq.gz
‘reverseReads’. The full path for the forward reads. e.g. /home/myproject/fastq/sample1_S22_L001_R2.fastq.gz
This file can be created with Excel and then saved as a tab-delimited file (File → Save as → ‘manifest.txt’ → Text (Tab delimited)), then copied across to your NextFlow project directory (using WinSCP, Cyberduck, etc). Example:
sampleID | forwardReads | reverseReads |
---|---|---|
groupA_1 | /home/myproject/fastq/sample1_S22_L001_R1.fastq.gz | /home/myproject/fastq/sample1_S22_L001_R2.fastq.gz |
groupA_2 | /home/myproject/fastq/sample2_S23_L001_R1.fastq.gz | /home/myproject/fastq/sample2_S23_L001_R2.fastq.gz |
etc… |
Metadata file
This is a tab separated values file (.tsv) that is required by QIIME2 to compare taxonomic diversity with phenotype (e.g. how diversity varies per experimental treatment). It contains the same sample IDs found in the manifest file and a column for each category of metadata you have for the samples. This may include sequence barcodes, experimental treatment group (e.g. high fat vs low fat) and any other measurements taken, such as age, date collected, tissue type, sex, collection location, weight, length, etc, etc, etc). QIIME2 will compare every metadata column with taxonomic results, then calculate and plot correlations and diversity indices. See here for more details:
https://docs.qiime2.org/2019.10/tutorials/metadata/
To get a better idea of the structure and format of the metadata file, you can download an example file from here:
https://data.qiime2.org/2019.10/tutorials/moving-pictures/sample_metadata.tsv
This file can also be created in Excel and saved as a tab-delimited file. It can be any format, but .tsv is the default (File → Save as → ‘metadata.tsv’ → Text (Tab delimited))
nextflow.config file
Create a custom nextflow.config file
Code Block |
---|
module load nano
nano nextflow.config |
Edit the newly created ‘nextflow.config’ file to contain something like the following:
Code Block |
---|
params {
max_cpus=32
max_memory=512.GB
max_time = 48.h
FW_primer = "TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG"
RV_primer = "GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC"
metadata = "metadata.txt"
reference_database = "Silva_132_release.zip"
retain_untrimmed = true
}
|
NOTE: This is an example nextflow.config file. Don’t simply copy and paste the above. You’ll need to modify it to reflect the primers you used to generate your sequences. FW_primer =
and RV_primer =
The remaining lines can stay the same, presuming that you called your metadata file 'metadata.txt
' and you have all the files in the directory where you will be running ampliseq from.
Running NextFlow’s ampliseq pipeline
Code Block |
---|
qsub -I -S /bin/bash -l walltime=72:00:00 -l select=1:ncpus=4:mem=8gb |
Code Block |
---|
module load java |
Code Block |
---|
cd nextflow/mahsa_illumina_16S/ |
params {
max_cpus=32
max_memory=512.GB
max_time = 48.h
FW_primer = "TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG"
RV_primer = "GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC"
metadata = "metadata.txt"
reference_database = "/home/whatmorp/Annette/16S/Silva_132_release.zip"
retain_untrimmed = true
}
Pointing to the place I DL Silva DB to originally
Pointing to Mahsa’s fastq files dir
metadata.txt file and manifest.tst files already created (need to add details here)
Running with
Code Block |
---|
nextflow run nf-core/ampliseq -profile singularity --manifest manifest.txt |
Running analysis on QUTs HPC
...
If you haven’t been set up or have used the HPC previously, click on this link for information on how to get access to and use the HPC:
Need a link here for HPC access and usage
Creating a shared workspace on the HPC
...
To request a node using PBS, submit a shell script containing your RAM/CPU/analysis time requirements and the code needed to run your analysis. For an overview of submitting a PBS job, see here:
Need a link here for creating PBS jobs
Alternatively, you can start up an ‘interactive’ node, using the following:
...