Page Comparison

This guide provides a step-by-step guide to:

1) convert BAM files (i.e., public) to paired-end FASTQ files; and

2) run the nextflow nf-core/sarek variant calling pipeline

...

Date created: 19/01/2023; Last update: 25/01/2023

for whole exome sequencing (WES) datasets.

Pre-requisites:

This user guide assumes that you have already installed conda/miniconda3 and nextflow. If you have not done this yet, follow the instructions below:

Conda installation: https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html

Code Block
wget https://repo.anaconda.com/miniconda/Miniconda3-py37_22.11.1-1-Linux-x86_64.sh bash Miniconda3-py37_22.11.1-1-Linux-x86_64.sh

nextflow installation: Nextflow
VCFtools: https://vcftools.github.io/man_latest.html

Create a Conda environment with tools needed for downstream analyses

Log into the HPC using your user credentials. To install tools, we will use an interactive PBS session:

Code Block
qsub -I -S /bin/bash -l walltime=10:00:00 -l select=1:ncpus=2:mem=4gb

Create a python Python 3.7 environment:

Code Block
conda create --name liver python=3.7

...

Code Block
conda install -c bioconda bedtools

Ins

talling To install several tools - Prepare , prepare a file called environment.yml - (example below). Tip: use a text editor (i.e., vim, nano, or other) to copy and paste the code below into the file.

...

Code Block
conda deactivate

Convert BAM to FASTQ

Move to the folder where all the BAM files are present and prepare the following script (i.e.,launch_BAM2FASTQ.pbs):

Code Block

#!/bin/bash -l
#PBS -N BAM2FASTQ
#PBS -l walltime=24:00:00
#PBS -l mem=8gb
#PBS -l ncpus=4

cd $PBS_O_WORKDIR

#activate the conda environment with the necessary tools
conda activate liver

#Sort reads in BAM file by indentifier-name (-n) using 4 CPUs (-@ 4). Note 'prefix' for sorted file noted after $i (input BAM file)
for i in `ls --color=never *.bam`
do
  echo $i
  samtools sort -@ 4 -n $i ${i%%.bam}_sorted
done

#Extract paired end reads in FASTQ format
for file in `ls --color=never *sorted.bam`
do
  echo $file
  bedtools bamtofastq -i $file -fq ${file%%.bam}_R1.fastq -fq2 ${file%%.bam}_R2.fastq
  #compress FASTQ files to run using the sarek pipeline
  gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R1.fastq.gz
  gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R2.fastq.gz
done

...

Check the submitted job(s):

Code Block
qjobs

Run variant calling using the nextflow nf-core/sarek pipeline

source: https://nf-co.re/sarek/3.1.2

To run Sarek 3 files are required:

launch.pbs → details on how to run the workflow
~/.nextflow/config → specify how to run the workflow in the HPC
samplesheet.csv → provides information on the samples and data to be used (i.e., FASTQ, BAM or CRAM)

...

Phase I: Preprocessing, mapping, markduplicates, recalibrate
Phase II: Variant calling
Phase III: Annotation

PHASE I - preprocessing

Below is an example of a launch_phase1.pbs file for mapping onto the selected genome:

Code Block

#!/bin/bash -l
#PBS -N sarek_I
#PBS -l walltime=48:00:00
#PBS -l select=1:ncpus=1:mem=5gb
cd $PBS_O_WORKDIR
NXF_OPTS='-Xms1g -Xmx4g'
module load java

#specify the nextflow version to use to run the workflow
export NXF_VER=22.06.1-edge

#run the prepocessing tasks
nextflow run nf-core/sarek \
        -r 3.1.1 \
        -profile singularity \
        --genome GATK.GRCh38 \
        --input samplesheet.csv

~/.nextflow/config file: (Note: You may already have this file if you installed Nextflow using this guide )

Code Block

singularity {
    cacheDir = '$HOME/NXF_SINGULARITY_CACHEDIR'
    autoMounts = true
}

conda {
    cacheDir = '$HOME/NXF_CONDA_CACHEDIR'
}

singularity {
    enabled = true
    autoMounts = true
}

process {
  executor = 'pbspro'
  beforeScript = {
      """
      source $HOME/.bashrc
      source $HOME/.profile
      """
  }
  scratch = false
  cleanup = false
}

Example of a samplesheet.csv file:

Code Block
patient,sample,lane,fastq_1,fastq_2 healthy_11,1,1,/path/to/data/1.Healthy/Healthy_Combined_11_sorted_R1.fastq.gz,/path/to/data/1.Healthy/Healthy_Combined_11_sorted_R2.fastq.gz

...

Code Block
qsub launch_phase1.pbs

PHASE II - variant calling

Prepare/edit the following launch_phase2.pbs script:

Code Block

#!/bin/bash -l
#PBS -N sarek_II
#PBS -l walltime=48:00:00
#PBS -l select=1:ncpus=1:mem=5gb
cd $PBS_O_WORKDIR
NXF_OPTS='-Xms1g -Xmx4g'
module load java

#specify the nextflow version to use to run the workflow
export NXF_VER=22.06.1-edge

#run the sarek pipeline
nextflow run nf-core/sarek \
        -r 3.1.1 \
        -profile singularity \
        --genome GATK.GRCh38 \
        --step variant_calling \
        --tools haplotypecaller \
        --wes \
        -resume

...

Alternatively, view the progress of the submitted job on the Nextflow Tower.

PHASE III - annotation

Download the singularity container for VEP in your catched Nextflow Singularity folder

Code Block
singularity pull --name nfcore-vep-106.1.GRCh38.img docker://nfcore/vep:106.1.GRCh38

Prepare/edit the following launch_phase3.pbs script:

Code Block

#!/bin/bash -l
#PBS -N sarek_III
#PBS -l walltime=48:00:00
#PBS -l select=1:ncpus=1:mem=5gb
cd $PBS_O_WORKDIR
NXF_OPTS='-Xms1g -Xmx4g'
module load java

#specify the nextflow version to use to run the workflow
export NXF_VER=22.06.1-edge

#run the sarek pipeline
nextflow run nf-core/sarek \
        -r 3.1.1 \
        -profile singularity \
        --genome GATK.GRCh38 \
        --step annotate \
        --tools vep,snpeff \
        --wes \
        -resume

...

Versions Compared

Old Version 17

New Version Current

Key

Pre-requisites:

Create a Conda environment with tools needed for downstream analyses

Convert BAM to FASTQ

Run variant calling using the nextflow nf-core/sarek pipeline

PHASE I - preprocessing

PHASE II - variant calling

PHASE III - annotation