Exercise 1 - test

Prior running the nf-core/sarek pipeline with real data, we will first run a test with sample data to make sure the pipeline runs properly.

Work in the HPC

Before we start using the HPC, let’s start an interactive session:

qsub -I -S /bin/bash -l walltime=10:00:00 -l select=1:ncpus=1:mem=4gb

Get a copy of the scripts to be used in this module

Use the terminal to log into the HPC and create a /RNAseq/ folder to run the nf-core/rnaseq pipeline. For example:

mkdir -p $HOME/workshop/sarek/scripts
cp /work/training/sarek/scripts/* $HOME/workshop/sarek/scripts/
ls -l $HOME/workshop/sarek/scripts/

Line 1: The -p indicates create 'parental directories as required. Thus the line 1 command creates both /workshop/ and the subfolder /workshop/scripts/
Line 2: Copies all files from /work/datasets/workshop/scripts/ as noted by an asterisk to the newly created folder $HOME/workshop/scripts/

Copy public data to your $HOME

mkdir -p $HOME/workshop/sarek/data/WES/trio
mkdir -p $HOME/workshop/sarek/data/WES/liver
cp /work/training/sarek/data/WES/trio/* $HOME/workshop/sarek/data/WES/trio
cp /work/training/sarek/data/WES/liver/* $HOME/workshop/sarek/data/WES/liver

Lines 1 -2: Command creates the folders to copy data
Line 3: Copies all files from /work/datasets/workshop/sarek/data/WES/trio folder as noted by an asterisk to newly created $HOME/workshop/sarek/data/WES/trio folder.
Line 4: Copies all files from /work/datasets/workshop/sarek/data/WES/liver folder as noted by an asterisk to newly created $HOME/workshop/sarek/data/WES/liver folder.

Create folders for running the nf-core/sarek pipeline

Let’s create an “RNAseq” folder to run the nf-core/rnaseq pipeline and move into it. For example:

mkdir -p $HOME/workshop/sarek
mkdir $HOME/workshop/sarek/run1_test
mkdir $HOME/workshop/sarek/run2_trio
mkdir $HOME/workshop/sarek/run3_liver
cd $HOME/workshop/

Lines 1-4: create sub-folders for each exercise
Line 5: change the directory to the folder “run1_test”
Line 6: print the current working directory

Exercise 1: Running a test with nf-core sample data

First, let’s assess the execution of the nf-core/rnaseq pipeline by running a test using sample data.

Copy the launch_nf-core_RNAseq_test.pbs to the working directory

cd $HOME/workshop/sarek/run1_test
cp $HOME/workshop/sarek/scripts/launch_nf-core_sarek_test.pbs .

View the content of the script as follows:

cat launch_nf-core_sarek_test.pbs

#!/bin/bash -l

#PBS -N nfsarek_run1_test

#PBS -l walltime=48:00:00

#PBS -l select=1:ncpus=1:mem=5gb

cd $PBS_O_WORKDIR

NXF_OPTS='-Xms1g -Xmx4g'

module load java

#specify the nextflow version to use to run the workflow

export NXF_VER=23.10.1

#run the sarek pipeline

nextflow run nf-core/sarek \

-r 3.3.2 \

-profile test,singularity \

--outdir ./results

nextflow command: nextflow run
pipeline name: nf-core/sarek
pipeline version: -r 3.3.2
container type and sample data: -profile test,singularity
output directory: --outdir results

Submitting the job

Submit the test job to the HPC cluster as follows:

qsub launch_nf-core_sarek_test.pbs

Monitoring the Run

qjobs

Outputs:

The test run should take about ~14 min to complete. Find run outputs in the “results” folder:

results/
├── csv
│   ├── markduplicates.csv
│   ├── markduplicates_no_table.csv
│   ├── recalibrated.csv
│   └── variantcalled.csv
├── multiqc
│   ├── multiqc_data
│   ├── multiqc_plots
│   └── multiqc_report.html
├── pipeline_info
│   ├── execution_report_2024-05-08_15-28-38.html
│   ├── execution_timeline_2024-05-08_15-28-38.html
│   ├── execution_trace_2024-05-08_15-28-38.txt
│   ├── params_2024-05-08_15-41-30.json
│   ├── pipeline_dag_2024-05-08_15-28-38.html
│   └── software_versions.yml
├── preprocessing
│   ├── markduplicates
│   ├── recalibrated
│   └── recal_table
├── reports
│   ├── bcftools
│   ├── fastqc
│   ├── markduplicates
│   ├── mosdepth
│   ├── samtools
│   └── vcftools
├── tabix
│   ├── genome.bed.gz
│   └── genome.bed.gz.tbi
└── variant_calling
    └── strelka

Once the pipeline has finished running - Assess the QC report:

NOTE: To proceed, you need to be on QUT’s WiFi network or signed via VPN.

To browse the working folder in the HPC type in the file finder:

Windows PC

\\hpc-fs\work\training\rnaseq

Mac

smb://hpc-fs/work/training/rnaseq

Evaluate the nucleotide distributions in the 5'-end and 3'-end of the sequenced reads (Read1 and Read2). Look into the “MultiQC” folder and open the provided HTML report.

Items to check:

The overall quality of the experiment and reads. Look at the “Sequence Quality Histogram” plot. For example, if Phred assigns a quality score of 30 to a base, the chances that this base is called incorrectly are 1 in 1000. Phred quality scores are logarithmically linked to error probabilities.

Phred Quality Score	Probability of incorrect base call	Base call accuracy
10	1 in 10	90%
20	1 in 100	99%
30	1 in 1000	99.9%
40	1 in 10,000	99.99%
50	1 in 100,000	99.999%
60	1 in 1,000,000	99.9999%