25S1W1 - Preparing working directory
Prior running the nf-core/sarek pipeline with real data, we will first prepare the working directory copy scripts and data that we will need to do the exercises.
Work on the HPC (aqua) |
---|
Before we start using the HPC, let’s start an interactive session:
qsub -I -S /bin/bash -l walltime=10:00:00 -l select=1:ncpus=1:mem=4gb
You should be in your home directory, if unsure you can run the following command:
cd ~
List the existing files and folders:
ls -l
Let’s create a folder for the workshop:
Option #1: recommend
mkdir -p $HOME/workshop/2025/S1W1/variant_calling
Option #2: making directories one at a time:
mkdir $HOME/workshop
mkdir $HOME/workshop/2025/
mkdir $HOME/workshop/2025/S1W1/
mkdir $HOME/workshop/2025/S1W1/variant_calling
Get a copy of the scripts to be used in this module
Now let’s create a ‘scripts’ folder and copy all scripts that we will using in the session:
mkdir -p $HOME/workshop/2025/S1W1/variant_calling/scripts
cp /work/training/2025/S1W1/session2_variant_calling/scripts/* $HOME/workshop/2025/S1W1/variant_calling/scripts/
Line 1: The -p indicates create 'parental directories as required. Thus the line 1 command creates both /workshop/ and the subfolder /workshop/sarek/scripts/
Line 2: Copies all files from /work/datasets/workshop/scripts/ as noted by an asterisk to the newly created folder $HOME/workshop/sarek/scripts/
Let’s check the list of files copied:
ls -l $HOME/workshop/2025/S1W1/variant_calling/scripts/
.
├── create_samplesheet_nf-core_sarek.py
├── launch_nf-core_sarek_liver.pbs
├── launch_nf-core_sarek_trio.pbs
├── run_create_sarek_samplesheet.sh
└── samplesheet.csv
Create folders for running the nf-core/sarek pipeline
Let’s create an “RNAseq” folder to run the nf-core/rnaseq pipeline and move into it. For example:
mkdir -p $HOME/workshop/2025/S1W1/variant_calling/runs/run1_trio
mkdir -p $HOME/workshop/2025/S1W1/variant_calling/runs/run2_liver
cd $HOME/workshop/2025/S1W1/variant_calling
Lines 1-3: create sub-folders for each exercise
Line 4: change the directory to the folder “run1_trio”
(Optional): Running a test with nf-core sample data
First, let’s assess the execution of the nf-core/rnaseq pipeline by running a test using sample data.
Copy the launch_nf-core_sarek_test.pbs
to the working directory
mkdir -p $HOME/workshop/2025/S1W1/variant_calling/runs/run_test
cd $HOME/workshop/2025/S1W1/variant_calling/runs/run_test
cp $HOME/workshop/2025/S1W1/variant_calling/scripts/launch_nf-core_sarek_test.pbs .
View the content of the script as follows:
cat launch_nf-core_sarek_test.pbs
nextflow command: nextflow run
pipeline name: nf-core/sarek
pipeline version: -r 3.4.4
container type and sample data: -profile test,singularity
output directory: --outdir results
Submitting the job
Submit the test job to the HPC cluster as follows:
qsub launch_nf-core_sarek_test.pbs
Monitoring the Run
qjobs
Outputs:
The test run should take about ~14 min to complete. Find run outputs in the “results” folder:
results/
├── csv
│ ├── markduplicates.csv
│ ├── markduplicates_no_table.csv
│ ├── recalibrated.csv
│ └── variantcalled.csv
├── multiqc
│ ├── multiqc_data
│ ├── multiqc_plots
│ └── multiqc_report.html
├── pipeline_info
│ ├── execution_report_2024-05-08_15-28-38.html
│ ├── execution_timeline_2024-05-08_15-28-38.html
│ ├── execution_trace_2024-05-08_15-28-38.txt
│ ├── params_2024-05-08_15-41-30.json
│ ├── pipeline_dag_2024-05-08_15-28-38.html
│ └── software_versions.yml
├── preprocessing
│ ├── markduplicates
│ ├── recalibrated
│ └── recal_table
├── reports
│ ├── bcftools
│ ├── fastqc
│ ├── markduplicates
│ ├── mosdepth
│ ├── samtools
│ └── vcftools
├── tabix
│ ├── genome.bed.gz
│ └── genome.bed.gz.tbi
└── variant_calling
└── strelka
Once the pipeline has finished running - Assess the QC report:
NOTE: To proceed, you need to be on QUT’s WiFi network or signed via VPN.
To browse the working folder in the HPC type in the file finder:
Windows PC
\\hpc-fs\work\training\rnaseq
Mac
smb://hpc-fs/work/training/rnaseq
Evaluate the nucleotide distributions in the 5'-end and 3'-end of the sequenced reads (Read1 and Read2). Look into the “MultiQC” folder and open the provided HTML report.
Go to next section: 25S1W1 - Case study 1: GiB family trio