This guide provides a step-by-step guide to:
1) convert BAM files (i.e., public) to paired-end FASTQ files; and
2) run the nextflow nf-core/sarek variant calling pipeline for whole exome sequencing (WES) datasets.
Pre-requisites:
This user guide assumes that you have already installed conda/miniconda3 and nextflow. If you have not done this yet, follow the instructions below:
Conda installation: https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html
Code Block wget https://repo.anaconda.com/miniconda/Miniconda3-py37_22.11.1-1-Linux-x86_64.sh bash Miniconda3-py37_22.11.1-1-Linux-x86_64.sh
nextflow installation: Nextflow
Create a Conda environment with tools needed for downstream
...
analyses
Log into the HPC using your user credentials. To install tools, we will use an interactive PBS session:
Code Block |
---|
qsub -I -S /bin/bash -l walltime=10:00:00 -l select=1:ncpus=2:mem=4gb |
Create a python Python 3.7 environment:
Code Block |
---|
conda create --name liver python=3.7 |
...
Code Block |
---|
conda activate liver |
Prepare copy and paste the code below into Then you can install one tool at a time or provide a list (see below); however, installing a list of tools may take a while.
We will use samtools and bcftools for the first part of the tutorial. Run the following commands:
Code Block |
---|
conda install -c bioconda samtools |
Code Block |
---|
conda install -c bioconda bedtools |
To install several tools, prepare a file called environment.yml - (example below). Tip: use a text editor (i.e., vim, nano, or other) to copy and paste the code below into the file.
Code Block |
---|
channels: - bioconda - conda-forge dependencies: - bedtoolsseqkit - samtoolsvcftools - seqkitbcftools - vcftools - emboss |
Run the following command to install additional tools
Code Block |
---|
conda env update --file environment.yml |
Convert BAM to FASTQ
STEP1: install bedtools if this is not yet availableTo deactivate the conda environment, run:
Code Block |
---|
conda install -c bioconda bedtools |
...
deactivate |
Convert BAM to FASTQ
Move to the folder where all the BAM files are present and run prepare the following PBS scriptscript (i.e.,launch_BAM2FASTQ.pbs):
Code Block |
---|
#!/bin/bash -l #PBS -N BAM2FASTQ #PBS -l walltime=1224:00:00 #PBS -l mem=8gb #PBS -l ncpus=4 cd $PBS_O_WORKDIR #activate the conda environment with the necessary tools conda activate soapdenovoliver #sort#Sort reads in BAM file by indentifier-name (-n) using 4 CPUs (-@ 4). Note 'prefix' for sorted file noted after $i (input BAM file) for i in `ls --color=never *.bam` do echo $i samtools sort -@ 4 -n $i input${i%%.bam input}_sorted done #extract #Extract paired end reads in FASTQ format for file in `ls --color=never *sorted.bam` do echo $file bedtools bamtofastq -i input_sorted.bam -fq output_r1.fastq -fq2 output_r2.fastq $file -fq ${file%%.bam}_R1.fastq -fq2 ${file%%.bam}_R2.fastq #compress FASTQ files to run using the sarek pipeline gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R1.fastq.gz gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R2.fastq.gz done |
Submit the job to the PBS scheduler:
Code Block |
---|
qsub launch_BAM2FASTQ.pbs |
Check the submitted job(s):
Code Block |
---|
qjobs |
Run variant calling using the nextflow nf-core/sarek pipeline
source: https://nf-co.re/sarek/3.1.2
To run Sarek 3 files are required:
launch.pbs → details on how to run the workflow
~/.nextflow/config → specify how to run the workflow in the HPC
samplesheet.csv → provides information on the samples and data to be used (i.e., FASTQ, BAM or CRAM)
We will run the Sarek pipeline in three phases:
Phase I: Preprocessing, mapping, markduplicates, recalibrate
Phase II: Variant calling
Phase III: Annotation
PHASE I - preprocessing
Below is an example of a launch_phase1.pbs file for mapping onto the selected genome:
Code Block |
---|
#!/bin/bash -l
#PBS -N sarek_I
#PBS -l walltime=48:00:00
#PBS -l select=1:ncpus=1:mem=5gb
cd $PBS_O_WORKDIR
NXF_OPTS='-Xms1g -Xmx4g'
module load java
#specify the nextflow version to use to run the workflow
export NXF_VER=22.06.1-edge
#run the prepocessing tasks
nextflow run nf-core/sarek \
-r 3.1.1 \
-profile singularity \
--genome GATK.GRCh38 \
--input samplesheet.csv |
~/.nextflow/config file: (Note: You may already have this file if you installed Nextflow using this guide )
Code Block |
---|
singularity {
cacheDir = '$HOME/NXF_SINGULARITY_CACHEDIR'
autoMounts = true
}
conda {
cacheDir = '$HOME/NXF_CONDA_CACHEDIR'
}
singularity {
enabled = true
autoMounts = true
}
process {
executor = 'pbspro'
beforeScript = {
"""
source $HOME/.bashrc
source $HOME/.profile
"""
}
scratch = false
cleanup = false
} |
Example of a samplesheet.csv file:
Code Block |
---|
patient,sample,lane,fastq_1,fastq_2
healthy_11,1,1,/path/to/data/1.Healthy/Healthy_Combined_11_sorted_R1.fastq.gz,/path/to/data/1.Healthy/Healthy_Combined_11_sorted_R2.fastq.gz |
Prepare a samplesheet.csv file that contains the information of all the samples to be processed. Once ready, submit the job to the PBS scheduler:
Code Block |
---|
qsub launch_phase1.pbs |
PHASE II - variant calling
Prepare/edit the following launch_phase2.pbs script:
Code Block |
---|
#!/bin/bash -l
#PBS -N sarek_II
#PBS -l walltime=48:00:00
#PBS -l select=1:ncpus=1:mem=5gb
cd $PBS_O_WORKDIR
NXF_OPTS='-Xms1g -Xmx4g'
module load java
#specify the nextflow version to use to run the workflow
export NXF_VER=22.06.1-edge
#run the sarek pipeline
nextflow run nf-core/sarek \
-r 3.1.1 \
-profile singularity \
--genome GATK.GRCh38 \
--step variant_calling \
--tools haplotypecaller \
--wes \
-resume |
Note: Sarek will automatically detect the input for the variant calling phase based on the results from the phase 1 outputs (i.e., results/csv/recalibrate.csv)
Submit the job to the PBS scheduler:
Code Block |
---|
qsub launch_phase2.pbs |
monitor the progress on the HPC:
Code Block |
---|
qjobs |
Alternatively, view the progress of the submitted job on the Nextflow Tower.
PHASE III - annotation
Download the singularity container for VEP in your catched Nextflow Singularity folder
Code Block |
---|
singularity pull --name nfcore-vep-106.1.GRCh38.img docker://nfcore/vep:106.1.GRCh38 |
Prepare/edit the following launch_phase3.pbs script:
Code Block |
---|
#!/bin/bash -l
#PBS -N sarek_III
#PBS -l walltime=48:00:00
#PBS -l select=1:ncpus=1:mem=5gb
cd $PBS_O_WORKDIR
NXF_OPTS='-Xms1g -Xmx4g'
module load java
#specify the nextflow version to use to run the workflow
export NXF_VER=22.06.1-edge
#run the sarek pipeline
nextflow run nf-core/sarek \
-r 3.1.1 \
-profile singularity \
--genome GATK.GRCh38 \
--step annotate \
--tools vep,snpeff \
--wes \
-resume |
Similarly to Phase 2, the Sarek pipeline will automatically detect the VCF input file for running annotation using the selected tool(s).
Submit the job to the PBS scheduler:
Code Block |
---|
qsub launch_phase3.pbs |
monitor the progress on the HPC:
Code Block |
---|
qjobs |
Alternatively, view the progress of the submitted job on the Nextflow Tower.