2. Initial setup
Overview
Initial requirements
To be able to run these exercises, you’ll need:
A HPC account
Nextflow installed on the HPC
PuTTy installed on your local computer
Access your HPC home directory from your PC
Instructions for getting a HPC account are here: https://qutvirtual4.qut.edu.au/group/staff/research/conducting/facilities/advanced-research-computing-storage/supercomputing/getting-started-with-hpc
If you haven’t installed Nextflow, follow the instructions in this link: Installing Nextflow
You’ll need PuTTY on your PC to access the HPC.
You can download PuTTY from here: https://the.earth.li/~sgtatham/putty/latest/w64/putty.exe
Then add the HPC (Lyra) address: lyra.qut.edu.au and then click ‘open’.
Setup Windows File Explorer to access your HPC home account. Follow the instructions here:
Finally, it would be VERY useful if you’ve either completed session 1 of these workshops (Intro to the HPC) or if not, you can watch some videos that go overt the basics: https://mediahub.qut.edu.au/media/t/0_d0bsv333
Interactive HPC session
In session 2 and 3 (variant calling) we submitted jobs to the HPC via a PBS script. This is useful for large datasets that require lots of processing time or resources. For smaller datasets (like 16S amplicon sequence data), you can start ‘interactive mode’ on the HPC, which allocates you a temporary node with RAM/CPUs you request.
Open PuTTy and paste the text below into the command prompt:
qsub -I -S /bin/bash -l walltime=4:00:00 -l select=1:ncpus=16:mem=128gb
After a few minutes interactive mode will start. You will now be able to do all your analysis - including running Nextflow and Nanopore workflows - in this interactive session.
NOTE: I’ve selected 16 CPUs and 128gb of memory. This is based on testing of the Nextflow workflows we’ll be using and their CPU/memory requirements.
Additional interactive job for running Nanopore analysis (may start quicker as fewer resources)
qsub -I -S /bin/bash -l walltime=4:00:00 -l select=1:ncpus=5:mem=12gb
Create working directories
We’ll be analysing both Illumina and Nanopore data, so first we need to create the workshop directories in your home drive on the HPC. Copy and paste the following into PuTTy:
cd $HOME
mkdir meta_workshop
mkdir meta_workshop/illumina
mkdir meta_workshop/illumina/fastq
mkdir meta_workshop/nanopore
mkdir meta_workshop/nanopore/fastq
cd meta_workshop
Modify Nextflow to run in ‘local’ mode
Since we’re not submitting our Nextflow run as a PBS script, we’ll need to change the parameters in the Nextflow config file to reflect this.
The following (run in PuTTy) will open up your Nextflow config file in a text editor called Nano.
Up the top of the file you’ll see a line that says executor = 'pbspro'
Change this to executor = 'local'
Then save the file by typing <ctrl> o, press Enter, and then <ctrl> x to exit Nano.
Downloading a public dataset
The dataset we’ll be using is from a paper called https://www.mdpi.com/2073-4425/11/9/1105 (more details in the Overview section).
The data is hosted by European Nucleotide Archive (ENA). In the ENA Browser you can find the project by the accession number listed in the paper: PRJEB28612. ENA Browser then can generate a download script to run in a Linux command line.
To save time, I’ve already created this script and downloaded the dataset to the HPC. You’ll just need to copy these files to your workshop directories.
Copy the Illumina and Nanopore fastq files to their respective workshop directories like so:
This will copy the fastq files into your meta_workshop/illumina/fastq
and meta_workshop/nanopore/fastq
directories.
Now we can go to the next section: Illumina using nfcore/ampliseq