2. Initial setup

Overview

 

Initial requirements

 

To be able to run these exercises, you’ll need:

  1. A HPC account

  2. Nextflow installed on the HPC

  3. PuTTy installed on your local computer

  4. Access your HPC home directory from your PC

 

Instructions for getting a HPC account are here: https://qutvirtual4.qut.edu.au/group/staff/research/conducting/facilities/advanced-research-computing-storage/supercomputing/getting-started-with-hpc

 

If you haven’t installed Nextflow, follow the instructions in this link: Installing Nextflow

 

You’ll need PuTTY on your PC to access the HPC.

You can download PuTTY from here: https://the.earth.li/~sgtatham/putty/latest/w64/putty.exe

Then add the HPC (Lyra) address: lyra.qut.edu.au and then click ‘open’.

image-20240527-223342.png

 

Setup Windows File Explorer to access your HPC home account. Follow the instructions here:

https://qutvirtual4.qut.edu.au/group/staff/research/conducting/facilities/advanced-research-computing-storage/supercomputing/using-hpc-filesystems

 

Finally, it would be VERY useful if you’ve either completed session 1 of these workshops (Intro to the HPC) or if not, you can watch some videos that go overt the basics: Running Jobs on the HPC - QUT MediaHub

 

Interactive HPC session

 

In session 2 and 3 (variant calling) we submitted jobs to the HPC via a PBS script. This is useful for large datasets that require lots of processing time or resources. For smaller datasets (like 16S amplicon sequence data), you can start ‘interactive mode’ on the HPC, which allocates you a temporary node with RAM/CPUs you request.

 

Open PuTTy and paste the text below into the command prompt:

qsub -I -S /bin/bash -l walltime=4:00:00 -l select=1:ncpus=16:mem=128gb

After a few minutes interactive mode will start. You will now be able to do all your analysis - including running Nextflow and Nanopore workflows - in this interactive session.

NOTE: I’ve selected 16 CPUs and 128gb of memory. This is based on testing of the Nextflow workflows we’ll be using and their CPU/memory requirements.

 

Additional interactive job for running Nanopore analysis (may start quicker as fewer resources)

qsub -I -S /bin/bash -l walltime=4:00:00 -l select=1:ncpus=5:mem=12gb

Create working directories

 

We’ll be analysing both Illumina and Nanopore data, so first we need to create the workshop directories in your home drive on the HPC. Copy and paste the following into PuTTy:

cd $HOME mkdir meta_workshop mkdir meta_workshop/illumina mkdir meta_workshop/illumina/fastq mkdir meta_workshop/nanopore mkdir meta_workshop/nanopore/fastq cd meta_workshop

 

Modify Nextflow to run in ‘local’ mode

 

Since we’re not submitting our Nextflow run as a PBS script, we’ll need to change the parameters in the Nextflow config file to reflect this.

The following (run in PuTTy) will open up your Nextflow config file in a text editor called Nano.

 

Up the top of the file you’ll see a line that says executor = 'pbspro'

Change this to executor = 'local'

Then save the file by typing <ctrl> o, press Enter, and then <ctrl> x to exit Nano.

 

Downloading a public dataset

 

The dataset we’ll be using is from a paper called https://www.mdpi.com/2073-4425/11/9/1105 (more details in the Overview section).

The data is hosted by European Nucleotide Archive (ENA). In the https://www.ebi.ac.uk/ena/browser/view/PRJEB28612 you can find the project by the accession number listed in the paper: PRJEB28612. ENA Browser then can generate a download script to run in a Linux command line.

To save time, I’ve already created this script and downloaded the dataset to the HPC. You’ll just need to copy these files to your workshop directories.

Copy the Illumina and Nanopore fastq files to their respective workshop directories like so:

This will copy the fastq files into your meta_workshop/illumina/fastq and meta_workshop/nanopore/fastq directories.

 

Now we can go to the next section: Illumina using nfcore/ampliseq