Version 3.1.1 - WES variant analysis

This guide provides a step-by-step guide to:

1) convert BAM files (i.e., public) to paired-end FASTQ files; and

2) run the nextflow nf-core/sarek variant calling pipeline for whole exome sequencing (WES) datasets.

Pre-requisites:

This user guide assumes that you have already installed conda/miniconda3 and nextflow. If you have not done this yet, follow the instructions below:

Create a Conda environment with tools needed for downstream analyses

Log into the HPC using your user credentials. To install tools, we will use an interactive PBS session:

qsub -I -S /bin/bash -l walltime=10:00:00 -l select=1:ncpus=2:mem=4gb

Create a Python 3.7 environment:

conda create --name liver python=3.7

Activate the conda environment:

Then you can install one tool at a time or provide a list (see below); however, installing a list of tools may take a while.

We will use samtools and bcftools for the first part of the tutorial. Run the following commands:

To install several tools, prepare a file called environment.yml (example below). Tip: use a text editor (i.e., vim, nano, or other) to copy and paste the code below into the file.

Run the following command to install additional tools

To deactivate the conda environment, run:

Convert BAM to FASTQ

Move to the folder where all the BAM files are present and prepare the following script (i.e., launch_BAM2FASTQ.pbs):

Submit the job to the PBS scheduler:

Check the submitted job(s):

Run variant calling using the nextflow nf-core/sarek pipeline

source: https://nf-co.re/sarek/3.1.2

To run Sarek 3 files are required:

  1. launch.pbs → details on how to run the workflow

  2. ~/.nextflow/config → specify how to run the workflow in the HPC

  3. samplesheet.csv → provides information on the samples and data to be used (i.e., FASTQ, BAM or CRAM)

We will run the Sarek pipeline in three phases:

  • Phase I: Preprocessing, mapping, markduplicates, recalibrate

  • Phase II: Variant calling

  • Phase III: Annotation

PHASE I - preprocessing

Below is an example of a launch_phase1.pbs file for mapping onto the selected genome:

~/.nextflow/config file: (Note: You may already have this file if you installed Nextflow using this guide )

Example of a samplesheet.csv file:

Prepare a samplesheet.csv file that contains the information of all the samples to be processed. Once ready, submit the job to the PBS scheduler:

PHASE II - variant calling

Prepare/edit the following launch_phase2.pbs script:

Note: Sarek will automatically detect the input for the variant calling phase based on the results from the phase 1 outputs (i.e., results/csv/recalibrate.csv)

Submit the job to the PBS scheduler:

monitor the progress on the HPC:

Alternatively, view the progress of the submitted job on the Nextflow Tower.

PHASE III - annotation

Download the singularity container for VEP in your catched Nextflow Singularity folder

Prepare/edit the following launch_phase3.pbs script:

Similarly to Phase 2, the Sarek pipeline will automatically detect the VCF input file for running annotation using the selected tool(s).

Submit the job to the PBS scheduler:

monitor the progress on the HPC:

Alternatively, view the progress of the submitted job on the Nextflow Tower.