Page Comparison

This guide provides a step-by-step guide to 1) convert BAM files (i.e., public) to FASTQ; and 2) run the nextflow nf-core/sarek variant calling pipeline.

Date created: 19/01/2023; Last update: 20/01/2023

Create a Conda environment with tools needed for downstream analyze

...

Code Block
conda activate liver

Prepare a file called environment.yml - Tip: use a text editor (i.e., vim, nano, or other) to copy and paste the code below into the file.

...

Move to the folder where all the BAM files are present and prepare the following script (i.e.,launch_BAM2FASTQ.pbs):

Code Block

#!/bin/bash -l
#PBS -N BAM2FASTQ
#PBS -l walltime=24:00:00
#PBS -l mem=8gb
#PBS -l ncpus=4

cd $PBS_O_WORKDIR

#activate the conda environment with the necessary tools
conda activate liver

#Sort reads in BAM file by indentifier-name (-n) using 4 CPUs (-@ 4). Note 'prefix' for sorted file noted after $i (input BAM file)
for i in `ls --color=never *.bam`
do
  echo $i
  samtools sort -@ 4 -n $i ${i%%.bam}_sorted
done

#Extract paired end reads in FASTQ format
for file in `ls --color=never *sorted.bam`
do
  echo $file
  bedtools bamtofastq -i $file -fq ${file%%.bam}_R1.fastq -fq2 ${file%%.bam}_R2.fastq
  #compress FASTQ files to run using the sarek pipeline
  gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R1.fastq.gz
  gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R2.fastq.gz
done

...

Code Block
qsub launch_BAM2FASTQ.pbs

Check the submited submitted job(s):

Code Block
qjobs

Run variant calling using the nextflow nf-core/sarek pipeline

To run Sarek

Create a conda environment with nf-core

Code Block
conda create --name nf-core python=3.8 nf-core nextflow conda activate nf-core

Code Blocknf-core download sarek3 files are required:

launch.pbs → details how to run the workflow
nextflow.config → specify how to run the workflow in the HPC
samplesheet.csv → provides information on the samples and data to be used (i.e., FASTQ, BAM or CRAM)

Below is an example of a launch.pbs file:

Code Block
#!/bin/bash -l #PBS -N sarek #PBS -l walltime=24:00:00 #PBS -l select=1:ncpus=1:mem=5gb cd $PBS_O_WORKDIR NXF_OPTS='-Xms1g -Xmx4g' module load java nextflow run nf-core/sarek \ -r 3.1.

...

1 \
        -profile singularity \
        --genome GATK.GRCh38 \
        --input index.csv \
        -config nextflow.config

nextflow.config file:

Code Block

singularity {
    cacheDir = '$HOME/NXF_SINGULARITY_CACHEDIR'
    autoMounts = true
}

conda {
    cacheDir = '$HOME/NXF_CONDA_CACHEDIR'
}

singularity {
    enabled = true
    autoMounts = true
}

process {
  executor = 'pbspro'
  beforeScript = {
      """
      source $HOME/.bashrc
      source $HOME/.profile
      """
  }
  scratch = false
  cleanup = false
}

Example of an samplesheet.csv file:

Code Block

patient,sample,lane,fastq_1,fastq_2
healthy_11,1,1,/path/to/data/NAFLD_exome_sequencing-166537376/1.Healthy/rename/Healthy_Combined_11_sorted_R1.fastq.gz,/path/to/data/NAFLD_exome_sequencing-166537376/1.Healthy/rename/Healthy_Combined_11_sorted_R2.fastq.gz

Versions Compared

Old Version 4

New Version 5

Key

Create a Conda environment with tools needed for downstream analyze

Run variant calling using the nextflow nf-core/sarek pipeline