Pre-requisites:
Review this one-hour-long detailed introduction to VIM editor: https://www.youtube.com/watch?v=IiwGbcd8S7I
(optional) Familiarity with one unix text editors (for example Vi/Vim or Nano):
Installing Putty and connecting to the HPC (Windows users; Mac users can directly use the available ‘terminal’ app)
Install Putty:
Installing PuTTY - QUT Media Hub
Connect to the HPC:
Connecting to the HPC with PuTTY - QUT MediaHub
BYO data or download public small RNA-seq datasets
Either bring your own dataset or use the following guide to Download public small RNA-see data
Download Reference microRNA sequences from miRBase
First, let’s download a copy of miRBAse reference sequences, including hairpin and mature microRNA sequences.
microRNA mature sequences:
wget https://www.mirbase.org/download/CURRENT/hairpin.fa gzip -c hairpin.fa > hairpin.fa.gz
Hairpin sequences:
wget https://www.mirbase.org/download/CURRENT/mature.fa gzip -c mature.fa.gz
Alternatively, submit the following PBS Pro script to the cluster. Before running the script, create a ‘reference’ folder (i.e., /myteam/data/reference/ ).
#!/bin/bash -l #PBS -N nfsmrnaseq #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=24:00:00 cd $PBS_O_WORKDIR wget https://www.mirbase.org/download/CURRENT/hairpin.fa gzip -c hairpin.fa > hairpin.fa.gz wget https://www.mirbase.org/download/CURRENT/mature.fa gzip -c mature.fa.gz
Run a test
Before running the pipeline with real data, run the following test:
nextflow run nf-core/smrnaseq -profile test,singularity --outdir results -r 2.1.0
To submit the above command to the HPC cluster, prepare the following script:
#!/bin/bash -l #PBS -N nfsmrnaseq #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=24:00:00 #work on current directory (folder) cd $PBS_O_WORKDIR #load java and set up memory settings to run nextflow module load java NXF_OPTS='-Xms1g -Xmx4g' nextflow run nf-core/smrnaseq -profile test,singularity --outdir results -r 2.1.0
Submitting the job
Once you have created the folder for the run, the samplesheet.csv file, nextflow.config, and launch.pbs, you are ready to submit.
Submit the run with this command
qsub launch.pbs
Monitoring the Run
You can use the command
qstat -u $USER
Alternatively, use the command
qjobs
to check on the job that you are running. Note, Nextflow will launch additional jobs during the run.
You can also check the .nextflow.log file for details on what is going on.
Preparing a sample metadata file
Now let’s prepare a samplesheet.csv file that specifies the name of your samples and the location of the raw FASTQ files
sample,fastq_1 SRR24302008,/path/to/raw/FASTQ/files/SRR24302008.fastq.gz SRR24302009,/path/to/raw/FASTQ/files/SRR24302009.fastq.gz SRR24302010,/path/to/raw/FASTQ/files/SRR24302010.fastq.gz SRR24302011,/path/to/raw/FASTQ/files/SRR24302011.fastq.gz SRR24302012,/path/to/raw/FASTQ/files/SRR24302012.fastq.gz SRR24302013,/path/to/raw/FASTQ/files/SRR24302013.fastq.gz SRR24302014,/path/to/raw/FASTQ/files/SRR24302014.fastq.gz SRR24302015,/path/to/raw/FASTQ/files/SRR24302015.fastq.gz SRR24302016,/path/to/raw/FASTQ/files/SRR24302016.fastq.gz SRR24302017,/path/to/raw/FASTQ/files/SRR24302017.fastq.gz SRR24302018,/path/to/raw/FASTQ/files/SRR24302018.fastq.gz SRR24302019,/path/to/raw/FASTQ/files/SRR24302019.fastq.gz SRR24302020,/path/to/raw/FASTQ/files/SRR24302020.fastq.gz SRR24302021,/path/to/raw/FASTQ/files/SRR24302021.fastq.gz SRR24302022,/path/to/raw/FASTQ/files/SRR24302022.fastq.gz SRR24302023,/path/to/raw/FASTQ/files/SRR24302023.fastq.gz SRR24302024,/path/to/raw/FASTQ/files/SRR24302024.fastq.gz SRR24302025,/path/to/raw/FASTQ/files/SRR24302025.fastq.gz SRR24302026,/path/to/raw/FASTQ/files/SRR24302026.fastq.gz SRR24302027,/path/to/raw/FASTQ/files/SRR24302027.fastq.gz
To generate the above file, let’s use the following PBS Pro script (i.e., called “launch_create_smRNAseq_samplesheet.pbs”)
#!/bin/bash -l #PBS -N samplesheet #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=12:00:00 #work on current directory (folder) cd $PBS_O_WORKDIR #User defined variables ########################################################## DIR='/path/to/raw/FASTQ/files' INDEX='samplesheet.csv' ########################################################## #load python module module load python/3.10.8-gcccore-12.2.0 #fetch the script to create the sample metadata table wget -L https://raw.githubusercontent.com/nf-core/rnaseq/master/bin/fastq_dir_to_samplesheet.py chmod +x fastq_dir_to_samplesheet.py #generate initial sample metadata file ./fastq_dir_to_samplesheet.py $DIR index.csv \ --strandedness auto \ --read1_extension .fastq.gz #format index file cat index.csv | awk -F "," '{print $1 "," $2}' > ${INDEX} #Remove intermediate files: rm index.csv fastq_dir_to_samplesheet.py
Assign to the “DIR” variable above the path where the raw FASTQ files are located. For example:
pwd
Copy and paste the path to the above script using VI or VIM (check prerequisites above).
Run the nextflow nf-core/smRNAseq pipeline.
Create a launch_nfsmRNAseq.pbs file that has the following information:
#!/bin/bash -l #PBS -N nfsmrnaseq #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=24:00:00 cd $PBS_O_WORKDIR module load java NXF_OPTS='-Xms1g -Xmx4g' nextflow run nf-core/smrnaseq -r 2.1.0 \ -profile singularity \ --outdir outdir \ --input samplesheet.csv \ --genome GRCh38 \ --three_prime_adapter 'AACTGTAGGCACCATCAAT'\ --fastp_min_length 18 \ --fastp_max_length 30 \ --hairpin /work/trtp/data/mirbase/hairpin.fa.gz \ --mature /work/trtp/data/mirbase/mature.fa.gz
Submit the job to the PBS scheduler:
qsub launch_phase3.pbs
monitor the progress on the HPC:
qjobs
Alternatively, view the progress of the submitted job on the Nextflow Tower.