Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Prepared by the eResearch Office, QUT.

“Sarek is a workflow designed to detect variants on whole genome or targeted sequencing data”

This page provides a guide to QUT users to run the nf-core/sarek workflow on the local HPC.QUT HPC.

Further details on the workflow can be found at:

https://nf-co.re/sarek

https://nf-co.re/sarek/2.6.1/usage

Install Nextflow

Find Nextflow The Sarek workflow requires Nextflow to be installed in your account on the HPC. Find details on how to install and test Nextflow , here and prepare a cextflow nextflow.config file and run a PBS pro submission script for Nextflow pipelines.

Additional information available here: https://nf-co.re/usage/installation

Installing Containers

Singularity is now installed and made available to all QUT users on the HPC. There is no need to install it.

Docker is not available on the HPC and we recommend to Singularity to run the nf-core/sarek pipeline

The Sarek Workflow Tools

The Sarek workflow will perform the following steps by default:

  • Sequencing quality control (FastQC)

  • Map Reads to Reference (BWA mem)

  • Mark Duplicates (GATK MarkDuplicatesSpark)

  • Base (Quality Score) Recalibration (GATK BaseRecalibratorGATK ApplyBQSR)

  • Preprocessing quality control (samtools stats)

  • Preprocessing quality control (Qualimap bamqc)

  • Overall pipeline run summaries (MultiQC)

A number of optional tools can be run during the workflow execution:

Germline variant calling can currently only be performed with the following variant callers:

  • FreeBayes, HaplotypeCaller, Manta, mpileup, Strelka, TIDDIT

Somatic variant calling can currently only be performed with the following variant callers:

  • ASCAT, Control-FREEC, FreeBayes, Manta, MSIsensor, Mutect2, Strelka

Tumor-only somatic variant calling can currently only be performed with the following variant callers:

  • Control-FREEC, Manta, mpileup, Mutect2, TIDDIT

Annotation is done using snpEff, VEP, or even both consecutively.

To enable these tools, the option --tools must be provided on the command line, or params.tools added to the config file.

To use multiple tools use a comma to separate them.

So to use the HaypeCaller, mpileup and snpEFF during the pipeline

Code Block
--tools 'HaplotypeCaller,mpileup,snpEFF'

Or within nextflow.config

Code Block
params.tools = 'HaplotypeCaller,mpileup,snpEFF'

or

Code Block
params {
  tools = 'HaplotypeCaller,mpileup,snpEFF'
}

Workflow steps

Sarek has these steps

mapping, prepare_recalibration, recalibrate, variant_calling, annotate, ControlFREEC

Data requirements change depending on the step you start with. Check the Sarek documentation for details.nextflow run nf-core/sarek -profile test,singularity

Run test

Code Block
#run test
nextflow run nf-core/sarek -profile test,singularity

#resume
nextflow run nf-core/sarek -profile test,singularity -resume 

Preparing Data

Starting the Sarek workflow at the “mapping” step requires paired fastq files (See Sarek documentation for details). You need to create a suitable TAB separated text file that will be the input for the workflow. Mapping requires the following columns

subject sex status sample lane fastq1 fastq2

Example sample.tsv

Code Block
Subject01 XX  0 Sample01  1 /work/group/data/subject01-sample01_R1.fastq.gz /work/group/data/subject01-sample01_R2.fastq.gz
Subject02 XX  0 Sample01  1 /work/group/data/subject02-sample01_R1.fastq.gz /work/group/data/subject02-sample01_R2.fastq.gz

Selecting a Genome

Please see Reference Genomes » nf-core (nf-co.re) for details on Genomes available.

The genome is to be provided on the command line or in the nextflow.config file

Code Block
--genome 'GRCh38'

or

Code Block
params {
  genome = 'CRCh38'
}

Putting it all together

Create a folder to store the run input and output.

The basic command to run Sarek is

Code Block
nextflow run /nf-cire/sarek -profile singularity --input 'input.tsv' --genome 'GRCh38' --tools 'HaplotypeCaller,mpileup,snpEFF'

Or, create a nextflow.config file to store the options in a different place.

Code Block
params {
  input = 'sample.tsv'
  genome = 'GRCh38'
  tools = 'HaplotypeCaller,mpileup,snpEFF,VEP,CNVkit'
  }
tower {
  accessToken = 'your tower token'
  endpoint = 'https://nftower.qut.edu.au/api'
  enabled = true
  }

For this, you have to put in your tower token. You will be assigned a token once you sign in via https://nftower.qut.edu.au/api

With this file in place, the command to run the pipeline is

Code Block
nextflow run /nf-cire/sarek -profile singularity

Preparing to run on the HPC

To run this on the HPC a PBS submission script needs to be created.

In the folder you have created for this run create launch.pbs

Code Block
#!/bin/bash -l
#PBS -N MySarekRun
#PBS -l walltime=168:00:00
#PBS -l select=1:ncpus=1:mem=5gb
cd $PBS_O_WORKDIR
NXF_OPTS='-Xms1g -Xmx4g'
module load java
nextflow run nf-core/sarek

An alternative option to run Sarek (define parameters in the command)

Code Block
#!/bin/bash -l
#PBS -N MySarekRun
#PBS -l walltime=168:00:00
#PBS -l select=1:ncpus=1:mem=5gb
cd $PBS_O_WORKDIR
NXF_OPTS='-Xms1g -Xmx4g'
module load java

#specify the nextflow version to use to run the workflow
export NXF_VER=22.06.1-edge

nextflow run nf-core/sarek -profile singularity \
  --input sample.tsv -name GRCh38_FBS1_LNCAP \
  --genome GRCh38 --tools HaplotypeCaller,snpEff,VEP \
  --generate_gvcf \
  -r 3.1.1

Submitting the job

Once you have created the folder for the run, the input.tsv file, nextflow.config and launch.pbs you are ready to submit.

Submit the run with this command (On Lyra)

Code Block
qsub launch.pbs

Monitoring the Run

You can use the command

Code Block
qstat -u $USER

To check on the jobs you are running. Nextflow will launch additional jobs during the run.

You can also check the .nextflow.log file for details on what is going on.

Finally, if you have configured the connection to the NFTower you can logon and check your run.