Content Comparison

This guide provides a step-by-step guide to 1) convert BAM files (i.e., public) to FASTQ; and 2) run the nextflow nf-core/sarek variant calling pipeline.

...

Code Block
conda activate liver

Prepare a file called environment.yml - Tip: use a text editor (i.e., vim, nano, or other) to copy and paste the code below into the file.

...

Move to the folder where all the BAM files are present and prepare the following script (i.e.,launch_BAM2FASTQ.pbs):

Code Block

#!/bin/bash -l
#PBS -N BAM2FASTQ
#PBS -l walltime=24:00:00
#PBS -l mem=8gb
#PBS -l ncpus=4

cd $PBS_O_WORKDIR

#activate the conda environment with the necessary tools
conda activate liver

#Sort reads in BAM file by indentifier-name (-n) using 4 CPUs (-@ 4). Note 'prefix' for sorted file noted after $i (input BAM file)
for i in `ls --color=never *.bam`
do
  echo $i
  samtools sort -@ 4 -n $i ${i%%.bam}_sorted
done

#Extract paired end reads in FASTQ format
for file in `ls --color=never *sorted.bam`
do
  echo $file
  bedtools bamtofastq -i $file -fq ${file%%.bam}_R1.fastq -fq2 ${file%%.bam}_R2.fastq
  #compress FASTQ files to run using the sarek pipeline
  gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R1.fastq.gz
  gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R2.fastq.gz
done

...

To run Sarek 3 files are required:

launch.pbs → details how to run the workflow
~/.nextflow/config → specify how to run the workflow in the HPC
samplesheet.csv → provides information on the samples and data to be used (i.e., FASTQ, BAM or CRAM)

We will run the Sarek pipeline in three phases:

Phase I: Preprocessing, mapping, markduplicates, recalibrate
Phase II: Variant calling
Phase III: Annotation

PHASE I - preprocessing

Below is an example of a launch_phase1.pbs file for mapping onto selected genome:

Code Block

#!/bin/bash -l
#PBS -N sarek_I
#PBS -l walltime=24:00:00
#PBS -l select=1:ncpus=1:mem=5gb
cd $PBS_O_WORKDIR
NXF_OPTS='-Xms1g -Xmx4g'
module load java

nextflow run nf-core/sarek \
        -r 3.1.1 \
        -profile singularity \
        --genome GATK.GRCh38 \
        --input samplesheet.csv

...

~/.nextflow/config file: (Note: You may already have this file if you installed Nextflow using this guide )

Code Block

singularity {
    cacheDir = '$HOME/NXF_SINGULARITY_CACHEDIR'
    autoMounts = true
}

conda {
    cacheDir = '$HOME/NXF_CONDA_CACHEDIR'
}

singularity {
    enabled = true
    autoMounts = true
}

process {
  executor = 'pbspro'
  beforeScript = {
      """
      source $HOME/.bashrc
      source $HOME/.profile
      """
  }
  scratch = false
  cleanup = false
}

Example of a samplesheet.csv file:

Code Block
patient,sample,lane,fastq_1,fastq_2 healthy_11,1,1,/path/to/data/1.Healthy/Healthy_Combined_11_sorted_R1.fastq.gz,/path/to/data/1.Healthy/Healthy_Combined_11_sorted_R2.fastq.gz

Prepare a samplesheet.csv file that contains the information of all the samples to be processed. Once ready, submit the job to the PBS scheduler:

Code Block
qsub launch_phase1.pbs

PHASE II - variant calling

Prepare/edit the following launch_phase2.pbs script:

Code Block

#!/bin/bash -l
#PBS -N sarek_II
#PBS -l walltime=24:00:00
#PBS -l select=1:ncpus=1:mem=5gb
cd $PBS_O_WORKDIR
NXF_OPTS='-Xms1g -Xmx4g'
module load java

#run the sarek pipeline
nextflow run nf-core/sarek \
        -r 3.1.1 \
        -profile singularity \
        --genome GATK.GRCh38 \
        --inputconfig samplesheetnextflow.csvconfig \
        --wesstep variant_calling \
        --tools 'deepvariant,freebayes,haplotypecaller,manta,mpileup,snpeff,strelka,vep'haplotypecaller \
        --wes \
        -resume

~/.nextflow/config file: (Note: You may already have this file if you installed Nextflow using this guide )

...

Note: Sarek will automatically detect the input for the variant calling phase based on the results from the phase 1 outputs (i.e., results/csv/recalibrate.csv)

Submit the job to the PBS scheduler:

Code Block
qsub launch_phase2.pbs

monitor the progress on the HPC:

Code Block
qjobs

Alternatively, view the progress of the submitted job on the Nextflow Tower.

PHASE III - annotation

Prepare/edit the following launch_phase3.pbs script:

Code Block

#!/bin/bash -l
#PBS -N sarek_III
#PBS -l walltime=24:00:00
#PBS -l select=1:ncpus=1:mem=5gb
cd $PBS_O_WORKDIR
NXF_OPTS='-Xms1g -Xmx4g'
module load java

#run the sarek pipeline
nextflow run nf-core/sarek \
        -r 3.1.1 \
        -profile singularity \
        --genome GATK.GRCh38 \
        -config nextflow.config \
        --step annotate \
       source $HOME/.profile
  --tools vep,snpeff \
    """   } --wes \
scratch = false   cleanup = false }

Example of an samplesheet.csv file:

Code Block

patient,sample,lane,fastq_1,fastq_2
healthy_11,1,1,/path/to/data/NAFLD_exome_sequencing-166537376/1.Healthy/rename/Healthy_Combined_11_sorted_R1.fastq.gz,/path/to/data/NAFLD_exome_sequencing-166537376/1.Healthy/rename/Healthy_Combined_11_sorted_R2.fastq.gz-resume

Similarly to Phase 2, the Sarek pipeline will automatically detect the VCF input file for running annotation using the selected tool(s).

Submit the job to the PBS scheduler:

Code Block
qsub launch_phase3.pbs

monitor the progress on the HPC:

Code Block
qjobs

Alternatively, view the progress of the submitted job on the Nextflow Tower.

Version	Old Version 7	New Version 8
Changes made by	Craig Windell	Roberto Barrero Gumiel
Saved on	24/01/2023	24/01/2023

Versions Compared

Key

PHASE I - preprocessing

PHASE II - variant calling

PHASE III - annotation