/
Version 3.1.1 - WES variant analysis

Version 3.1.1 - WES variant analysis

This guide provides a step-by-step guide to:

1) convert BAM files (i.e., public) to paired-end FASTQ files; and

2) run the nextflow nf-core/sarek variant calling pipeline for whole exome sequencing (WES) datasets.

Pre-requisites:

This user guide assumes that you have already installed conda/miniconda3 and nextflow. If you have not done this yet, follow the instructions below:

Create a Conda environment with tools needed for downstream analyses

Log into the HPC using your user credentials. To install tools, we will use an interactive PBS session:

qsub -I -S /bin/bash -l walltime=10:00:00 -l select=1:ncpus=2:mem=4gb

Create a Python 3.7 environment:

conda create --name liver python=3.7

Activate the conda environment:

conda activate liver

Then you can install one tool at a time or provide a list (see below); however, installing a list of tools may take a while.

We will use samtools and bcftools for the first part of the tutorial. Run the following commands:

conda install -c bioconda samtools
conda install -c bioconda bedtools

To install several tools, prepare a file called environment.yml (example below). Tip: use a text editor (i.e., vim, nano, or other) to copy and paste the code below into the file.

channels: - bioconda - conda-forge dependencies: - seqkit - vcftools - bcftools - emboss

Run the following command to install additional tools

conda env update --file environment.yml

To deactivate the conda environment, run:

conda deactivate

Convert BAM to FASTQ

Move to the folder where all the BAM files are present and prepare the following script (i.e., launch_BAM2FASTQ.pbs):

#!/bin/bash -l #PBS -N BAM2FASTQ #PBS -l walltime=24:00:00 #PBS -l mem=8gb #PBS -l ncpus=4 cd $PBS_O_WORKDIR #activate the conda environment with the necessary tools conda activate liver #Sort reads in BAM file by indentifier-name (-n) using 4 CPUs (-@ 4). Note 'prefix' for sorted file noted after $i (input BAM file) for i in `ls --color=never *.bam` do echo $i samtools sort -@ 4 -n $i ${i%%.bam}_sorted done #Extract paired end reads in FASTQ format for file in `ls --color=never *sorted.bam` do echo $file bedtools bamtofastq -i $file -fq ${file%%.bam}_R1.fastq -fq2 ${file%%.bam}_R2.fastq #compress FASTQ files to run using the sarek pipeline gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R1.fastq.gz gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R2.fastq.gz done

Submit the job to the PBS scheduler:

qsub launch_BAM2FASTQ.pbs

Check the submitted job(s):

qjobs

Run variant calling using the nextflow nf-core/sarek pipeline

source: https://nf-co.re/sarek/3.1.2

To run Sarek 3 files are required:

  1. launch.pbs → details on how to run the workflow

  2. ~/.nextflow/config → specify how to run the workflow in the HPC

  3. samplesheet.csv → provides information on the samples and data to be used (i.e., FASTQ, BAM or CRAM)

We will run the Sarek pipeline in three phases:

  • Phase I: Preprocessing, mapping, markduplicates, recalibrate

  • Phase II: Variant calling

  • Phase III: Annotation

PHASE I - preprocessing

Below is an example of a launch_phase1.pbs file for mapping onto the selected genome:

#!/bin/bash -l #PBS -N sarek_I #PBS -l walltime=48:00:00 #PBS -l select=1:ncpus=1:mem=5gb cd $PBS_O_WORKDIR NXF_OPTS='-Xms1g -Xmx4g' module load java #specify the nextflow version to use to run the workflow export NXF_VER=22.06.1-edge #run the prepocessing tasks nextflow run nf-core/sarek \ -r 3.1.1 \ -profile singularity \ --genome GATK.GRCh38 \ --input samplesheet.csv

~/.nextflow/config file: (Note: You may already have this file if you installed Nextflow using this guide )

singularity { cacheDir = '$HOME/NXF_SINGULARITY_CACHEDIR' autoMounts = true } conda { cacheDir = '$HOME/NXF_CONDA_CACHEDIR' } singularity { enabled = true autoMounts = true } process { executor = 'pbspro' beforeScript = { """ source $HOME/.bashrc source $HOME/.profile """ } scratch = false cleanup = false }

Example of a samplesheet.csv file:

patient,sample,lane,fastq_1,fastq_2 healthy_11,1,1,/path/to/data/1.Healthy/Healthy_Combined_11_sorted_R1.fastq.gz,/path/to/data/1.Healthy/Healthy_Combined_11_sorted_R2.fastq.gz

Prepare a samplesheet.csv file that contains the information of all the samples to be processed. Once ready, submit the job to the PBS scheduler:

qsub launch_phase1.pbs

PHASE II - variant calling

Prepare/edit the following launch_phase2.pbs script:

#!/bin/bash -l #PBS -N sarek_II #PBS -l walltime=48:00:00 #PBS -l select=1:ncpus=1:mem=5gb cd $PBS_O_WORKDIR NXF_OPTS='-Xms1g -Xmx4g' module load java #specify the nextflow version to use to run the workflow export NXF_VER=22.06.1-edge #run the sarek pipeline nextflow run nf-core/sarek \ -r 3.1.1 \ -profile singularity \ --genome GATK.GRCh38 \ --step variant_calling \ --tools haplotypecaller \ --wes \ -resume

Note: Sarek will automatically detect the input for the variant calling phase based on the results from the phase 1 outputs (i.e., results/csv/recalibrate.csv)

Submit the job to the PBS scheduler:

qsub launch_phase2.pbs

monitor the progress on the HPC:

qjobs

Alternatively, view the progress of the submitted job on the Nextflow Tower.

PHASE III - annotation

Download the singularity container for VEP in your catched Nextflow Singularity folder

singularity pull --name nfcore-vep-106.1.GRCh38.img docker://nfcore/vep:106.1.GRCh38

Prepare/edit the following launch_phase3.pbs script:

#!/bin/bash -l #PBS -N sarek_III #PBS -l walltime=48:00:00 #PBS -l select=1:ncpus=1:mem=5gb cd $PBS_O_WORKDIR NXF_OPTS='-Xms1g -Xmx4g' module load java #specify the nextflow version to use to run the workflow export NXF_VER=22.06.1-edge #run the sarek pipeline nextflow run nf-core/sarek \ -r 3.1.1 \ -profile singularity \ --genome GATK.GRCh38 \ --step annotate \ --tools vep,snpeff \ --wes \ -resume

Similarly to Phase 2, the Sarek pipeline will automatically detect the VCF input file for running annotation using the selected tool(s).

Submit the job to the PBS scheduler:

qsub launch_phase3.pbs

monitor the progress on the HPC:

qjobs

Alternatively, view the progress of the submitted job on the Nextflow Tower.

 

Related content