This guide provides a step-by-step guide to:
1) convert BAM files (i.e., public) to paired-end FASTQ files; and
2) run the nextflow nf-core/sarek variant calling pipeline for whole exome sequencing (WES) datasets.
Pre-requisites:
This user guide assumes that you have already installed conda/miniconda3 and nextflow. If you have not done this yet, follow the instructions below:
...
Code Block |
---|
qsub -I -S /bin/bash -l walltime=10:00:00 -l select=1:ncpus=2:mem=4gb |
Create a python Python 3.7 environment:
Code Block |
---|
conda create --name liver python=3.7 |
...
To install several tools, prepare a file called environment.yml (example below). Tip: use a text editor (i.e., vim, nano, or other) to copy and paste the code below into the file.
...
Move to the folder where all the BAM files are present and prepare the following script (i.e.,launch_BAM2FASTQ.pbs):
Code Block |
---|
#!/bin/bash -l #PBS -N BAM2FASTQ #PBS -l walltime=24:00:00 #PBS -l mem=8gb #PBS -l ncpus=4 cd $PBS_O_WORKDIR #activate the conda environment with the necessary tools conda activate liver #Sort reads in BAM file by indentifier-name (-n) using 4 CPUs (-@ 4). Note 'prefix' for sorted file noted after $i (input BAM file) for i in `ls --color=never *.bam` do echo $i samtools sort -@ 4 -n $i ${i%%.bam}_sorted done #Extract paired end reads in FASTQ format for file in `ls --color=never *sorted.bam` do echo $file bedtools bamtofastq -i $file -fq ${file%%.bam}_R1.fastq -fq2 ${file%%.bam}_R2.fastq #compress FASTQ files to run using the sarek pipeline gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R1.fastq.gz gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R2.fastq.gz done |
...
To run Sarek 3 files are required:
launch.pbs → details on how to run the workflow
~/.nextflow/config → specify how to run the workflow in the HPC
samplesheet.csv → provides information on the samples and data to be used (i.e., FASTQ, BAM or CRAM)
...
PHASE I - preprocessing
Below is an example of a launch_phase1.pbs file for mapping onto the selected genome:
Code Block |
---|
#!/bin/bash -l #PBS -N sarek_I #PBS -l walltime=48:00:00 #PBS -l select=1:ncpus=1:mem=5gb cd $PBS_O_WORKDIR NXF_OPTS='-Xms1g -Xmx4g' module load java #specify the nextflow version to use to run the workflow export NXF_VER=22.06.1-edge #run the prepocessing tasks nextflow run nf-core/sarek \ -r 3.1.1 \ -profile singularity \ --genome GATK.GRCh38 \ --input samplesheet.csv |
...
Code Block |
---|
singularity { cacheDir = '$HOME/NXF_SINGULARITY_CACHEDIR' autoMounts = true } conda { cacheDir = '$HOME/NXF_CONDA_CACHEDIR' } singularity { enabled = true autoMounts = true } process { executor = 'pbspro' beforeScript = { """ source $HOME/.bashrc source $HOME/.profile """ } scratch = false cleanup = false } |
Example of a samplesheet.csv file:
Code Block |
---|
patient,sample,lane,fastq_1,fastq_2 healthy_11,1,1,/path/to/data/1.Healthy/Healthy_Combined_11_sorted_R1.fastq.gz,/path/to/data/1.Healthy/Healthy_Combined_11_sorted_R2.fastq.gz |
...
Prepare/edit the following launch_phase2.pbs script:
Code Block |
---|
#!/bin/bash -l #PBS -N sarek_II #PBS -l walltime=48:00:00 #PBS -l select=1:ncpus=1:mem=5gb cd $PBS_O_WORKDIR NXF_OPTS='-Xms1g -Xmx4g' module load java #specify the nextflow version to use to run the workflow export NXF_VER=22.06.1-edge #run the sarek pipeline nextflow run nf-core/sarek \ -r 3.1.1 \ -profile singularity \ --genome GATK.GRCh38 \ --step variant_calling \ --tools haplotypecaller \ --wes \ -resume |
...
Prepare/edit the following launch_phase3.pbs script:
Code Block |
---|
#!/bin/bash -l
#PBS -N sarek_III
#PBS -l walltime=48:00:00
#PBS -l select=1:ncpus=1:mem=5gb
cd $PBS_O_WORKDIR
NXF_OPTS='-Xms1g -Xmx4g'
module load java
#specify the nextflow version to use to run the workflow
export NXF_VER=22.06.1-edge
#run the sarek pipeline
nextflow run nf-core/sarek \
-r 3.1.1 \
-profile singularity \
--genome GATK.GRCh38 \
--step annotate \
--tools vep,snpeff \
--wes \
-resume |
...