Page Comparison

This guide provides a step-by-step guide to 1) convert BAM files (i.e., public) to FASTQ; and 2) run the nextflow nf-core/sarek variant calling pipeline.

Date created: 19/01/2023; Last update: 25/01/2023

Create a Conda environment with tools needed for downstream

...

analyses

Create a python 3.7 environment:

...

Code Block
conda activate liver

Prepare a file called environment.yml - Tip: use a text editor (i.e., vim, nano, or other) to copy and paste the code below into the file.

...

Move to the folder where all the BAM files are present and prepare the following script (i.e.,launch_BAM2FASTQ.pbs):

Code Block

#!/bin/bash -l
#PBS -N BAM2FASTQ
#PBS -l walltime=24:00:00
#PBS -l mem=8gb
#PBS -l ncpus=4

cd $PBS_O_WORKDIR

#activate the conda environment with the necessary tools
conda activate liver

#Sort reads in BAM file by indentifier-name (-n) using 4 CPUs (-@ 4). Note 'prefix' for sorted file noted after $i (input BAM file)
for i in `ls --color=never *.bam`
do
  echo $i
  samtools sort -@ 4 -n $i ${i%%.bam}_sorted
done

#Extract paired end reads in FASTQ format
for file in `ls --color=never *sorted.bam`
do
  echo $file
  bedtools bamtofastq -i $file -fq ${file%%.bam}_R1.fastq -fq2 ${file%%.bam}_R2.fastq
  #compress FASTQ files to run using the sarek pipeline
  gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R1.fastq.gz
  gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R2.fastq.gz
done

...

To run Sarek 3 files are required:

launch.pbs → details how to run the workflow
~/.nextflow/config → specify how to run the workflow in the HPC
samplesheet.csv → provides information on the samples and data to be used (i.e., FASTQ, BAM or CRAM)

...

PHASE I - preprocessing

Below is an example of a launch_phase1.pbs file for mapping onto selected genome:

...

Code Block

singularity {
    cacheDir = '$HOME/NXF_SINGULARITY_CACHEDIR'
    autoMounts = true
}

conda {
    cacheDir = '$HOME/NXF_CONDA_CACHEDIR'
}

singularity {
    enabled = true
    autoMounts = true
}

process {
  executor = 'pbspro'
  beforeScript = {
      """
      source $HOME/.bashrc
      source $HOME/.profile
      """
  }
  scratch = false
  cleanup = false
}

Example of a samplesheet.csv file:

Code Block
patient,sample,lane,fastq_1,fastq_2 healthy_11,1,1,/path/to/data/1.Healthy/Healthy_Combined_11_sorted_R1.fastq.gz,/path/to/data/1.Healthy/Healthy_Combined_11_sorted_R2.fastq.gz

...

Prepare/edit the following launch_phase2.pbs script:

Code Block

#!/bin/bash -l
#PBS -N sarek_II
#PBS -l walltime=48:00:00
#PBS -l select=1:ncpus=1:mem=5gb
cd $PBS_O_WORKDIR
NXF_OPTS='-Xms1g -Xmx4g'
module load java

#run the sarek pipeline
nextflow run nf-core/sarek \
        -r 3.1.1 \
        -profile singularity \
        --genome GATK.GRCh38 \
        --step variant_calling \
        --tools haplotypecaller \
        --wes \
        -resume

...

Prepare/edit the following launch_phase3.pbs script:

Code Block

#!/bin/bash -l
#PBS -N sarek_III
#PBS -l walltime=48:00:00
#PBS -l select=1:ncpus=1:mem=5gb
cd $PBS_O_WORKDIR
NXF_OPTS='-Xms1g -Xmx4g'
module load java

#run the sarek pipeline
nextflow run nf-core/sarek \
        -r 3.1.1 \
        -profile singularity \
        --genome GATK.GRCh38 \
        --step annotate \
        --tools vep,snpeff \
        --wes \
        -resume

...

Versions Compared

Old Version 12

New Version 13

Key

Create a Conda environment with tools needed for downstream

analyses

PHASE I - preprocessing