Overview
Initial requirements
To be able to run these exercises, you’ll need:
A HPC account
Nextflow installed
Access your HPC home directory from your PC
Instructions for getting a HPC account are here: https://qutvirtual4.qut.edu.au/group/staff/research/conducting/facilities/advanced-research-computing-storage/supercomputing/getting-started-with-hpc
If you haven’t installed Nextflow, follow the instructions in this link: Installing Nextflow
Setup Windows File Explorer to access your HPC home account. Follow the instructions here:
Finally, it would be VERY useful if you’ve either completed session 1 of these workshops (Intro to the HPC) or if not, you can watch some videos that go overt the basics: https://mediahub.qut.edu.au/media/t/0_d0bsv333
Interactive HPC session
In session 2 and 3 (variant calling) we submitted jobs to the HPC via a PBS script. This is useful for large datasets that require lots of processing time or resources. For smaller datasets (like 16S amplicon sequence data), you can start ‘interactive mode’ on the HPC, which allocates you a temporary node with RAM/CPUs you request.
Open PuTTy and paste the below into the command prompt:
qsub -I -S /bin/bash -l walltime=4:00:00 -l select=1:ncpus=16:mem=64gb
After a few minutes interactive mode will start. You will now be able to do all your analysis - including running Nextflow and Nanopore workflows - in this interactive session.
NOTE: I’ve selected 16 CPUs and 64gb of memory. This is based on testing of the Nextflow workflows we’ll be using and their CPU/memory requirements.
Create working directories
We’ll be analysing both Illumina and Nanopore data, so first we need to create the workshop directories in your home drive on the HPC. Copy and paste the following into PuTTy:
cd $HOME mkdir meta_workshop mkdir meta_workshop/illumina mkdir meta_workshop/illumina/fastq mkdir meta_workshop/nanopore mkdir meta_workshop/nanopore/fastq cd meta_workshop
Modify Nextflow to run in ‘local’ mode
Since we’re not submitting our Nextflow run as a PBS script, we’ll need to change the parameters in the Nextflow config file to reflect this.
The following (run in PuTTy) will open up your Nextflow config file in a text editor called Nano.
module load nano nano $HOME/.nextflow/config
Up the top of the file you’ll see a line that says executor = 'pbspro'
Change this to executor = 'local'
Then save the file by typing <ctrl> o and then <ctrl> x to exit Nano.
Downloading a public dataset
The dataset we’ll be using is from a paper called https://www.mdpi.com/2073-4425/11/9/1105 (more details in the Overview section).
The data is hosted by European Nucleotide Archive (ENA). In the https://www.ebi.ac.uk/ena/browser/view/PRJEB28612 you can find the project by the accession number listed in the paper: PRJEB28612. ENA Browser then can generate a download script to run in a Linux command line.
To save time, I’ve already created this script and downloaded the dataset to the HPC. You’ll just need to copy these files to your workshop directories.
Copy the Illumina and Nanopore fastq files to their respective workshop directories like so:
cp /work/training/metagenomics/public_data/Illumina*.fastq.gz illumina/fastq cp /work/training/metagenomics/public_data/Nanopore*.fastq.gz nanopore/fastq
This will copy the fastq files into your meta_workshop/illumina/fastq
and meta_workshop/nanopore/fastq
directories.
Now we can go to the next section: Illumina using nfcore/ampliseq