Introduction
Variant calling entails identifying single nucleotide polymorphisms (SNPs) and small insertions and deletion (indels) from next generation sequencing data. This tutorial will cover SNP and Indel detection in germline cells. Other more complex rearrangements (such as Copy Number Variations) require additional analysis not covered in this tutorial.
Public data
https://www.genome.gov/10001688/international-hapmap-project
Reference HapMap Trio:
Sample ID | Description | Biological sample source |
---|---|---|
NA12878 (Daughter) | Mother; donor subject has a single bp (G-to-A) transition at nucleotide 681 in exon 5 of the CYP2C19 gene (CYP2C19*2) which creates an aberrant splice site. The change altered the reading frame of the mRNA starting with amino acid 215 and produced a premature stop codon 20 amino acids downstream, resulting in a truncated, nonfunctional protein. Because of the aberrant splice site, a 40-bp deletion occurred at the beginning of exon 5 (from bp 643 to bp 682), resulting in deletion of amino acids 215 to 227. The truncated protein had 234 amino acids and would be catalytically inactive because it lacked the heme-binding region. | https://catalog.coriell.org/0/Sections/Search/Sample_Detail.aspx?Ref=NA12878&Product=DNA |
NA12891 (Father) | Maternal Grandfather; donor subject is homozygous for a single bp (G-to-A) transition at nucleotide 681 in exon 5 of the CYP2C19 gene (CYP2C19*2) which creates an aberrant splice site. The change altered the reading frame of the mRNA starting with amino acid 215 and produced a premature stop codon 20 amino acids downstream, resulting in a truncated, nonfunctional protein. Because of the aberrant splice site, a 40-bp deletion occurred at the beginning of exon 5 (from bp 643 to bp 682), resulting in deletion of amino acids 215 to 227. The truncated protein had 234 amino acids and would be catalytically inactive because it lacked the heme-binding region. | https://catalog.coriell.org/0/Sections/Search/Sample_Detail.aspx?Ref=NA12891&Product=DNA |
NA12892 (Mother) | Maternal Grandmother | https://catalog.coriell.org/0/Sections/Search/Sample_Detail.aspx?Ref=NA12892&Product=DNA |
https://www.nist.gov/programs-projects/genome-bottle
WGS = Whole Genome Sequencing
WES = Whole Exome Sequencing
Overview of the Nextflow nf-core/sarek variant calling pipeline
Source : https://nf-co.re/sarek/3.2.3
This guide provides a step-by-step guide to:
1) convert BAM files (i.e., public) to paired-end FASTQ files; and
2) run the nextflow nf-core/sarek variant calling pipeline for whole exome sequencing (WES) datasets.
Pre-requisites:
This user guide assumes that you have already installed conda/miniconda3 and nextflow. If you have not done this yet, follow the instructions below:
Conda installation: https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html
wget https://repo.anaconda.com/miniconda/Miniconda3-py37_22.11.1-1-Linux-x86_64.sh bash Miniconda3-py37_22.11.1-1-Linux-x86_64.sh
nextflow installation: NextFlow quick start
Create a Conda environment with tools needed for downstream analyses
Log into the HPC using your user credentials. To install tools, we will use an interactive PBS session:
qsub -I -S /bin/bash -l walltime=10:00:00 -l select=1:ncpus=2:mem=4gb
Create a Python 3.7 environment:
conda create --name liver python=3.7
Activate the conda environment:
conda activate liver
Then you can install one tool at a time or provide a list (see below); however, installing a list of tools may take a while.
We will use samtools and bcftools for the first part of the tutorial. Run the following commands:
conda install -c bioconda samtools
conda install -c bioconda bedtools
To install several tools, prepare a file called environment.yml (example below). Tip: use a text editor (i.e., vim, nano, or other) to copy and paste the code below into the file.
channels: - bioconda - conda-forge dependencies: - seqkit - vcftools - bcftools - emboss
Run the following command to install additional tools
conda env update --file environment.yml
To deactivate the conda environment, run:
conda deactivate
Convert BAM to FASTQ
Move to the folder where all the BAM files are present and prepare the following script (i.e., launch_BAM2FASTQ.pbs):
#!/bin/bash -l #PBS -N BAM2FASTQ #PBS -l walltime=24:00:00 #PBS -l mem=8gb #PBS -l ncpus=4 cd $PBS_O_WORKDIR #activate the conda environment with the necessary tools conda activate liver #Sort reads in BAM file by indentifier-name (-n) using 4 CPUs (-@ 4). Note 'prefix' for sorted file noted after $i (input BAM file) for i in `ls --color=never *.bam` do echo $i samtools sort -@ 4 -n $i ${i%%.bam}_sorted done #Extract paired end reads in FASTQ format for file in `ls --color=never *sorted.bam` do echo $file bedtools bamtofastq -i $file -fq ${file%%.bam}_R1.fastq -fq2 ${file%%.bam}_R2.fastq #compress FASTQ files to run using the sarek pipeline gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R1.fastq.gz gzip -c -9 ${file%%.bam}_R1.fastq > ${file%%.bam}_R2.fastq.gz done
Submit the job to the PBS scheduler:
qsub launch_BAM2FASTQ.pbs
Check the submitted job(s):
qjobs
Run variant calling using the nextflow nf-core/sarek pipeline
source: https://nf-co.re/sarek/3.1.2
To run Sarek 3 files are required:
launch.pbs → details on how to run the workflow
~/.nextflow/config → specify how to run the workflow in the HPC
samplesheet.csv → provides information on the samples and data to be used (i.e., FASTQ, BAM or CRAM)
We will run the Sarek pipeline in three phases:
Phase I: Preprocessing, mapping, markduplicates, recalibrate
Phase II: Variant calling
Phase III: Annotation
PHASE I - preprocessing
Below is an example of a launch_phase1.pbs file for mapping onto the selected genome:
#!/bin/bash -l #PBS -N sarek_I #PBS -l walltime=48:00:00 #PBS -l select=1:ncpus=1:mem=5gb cd $PBS_O_WORKDIR NXF_OPTS='-Xms1g -Xmx4g' module load java #specify the nextflow version to use to run the workflow export NXF_VER=22.06.1-edge #run the prepocessing tasks nextflow run nf-core/sarek \ -r 3.1.1 \ -profile singularity \ --genome GATK.GRCh38 \ --input samplesheet.csv
~/.nextflow/config file: (Note: You may already have this file if you installed Nextflow using this guide )
singularity { cacheDir = '$HOME/NXF_SINGULARITY_CACHEDIR' autoMounts = true } conda { cacheDir = '$HOME/NXF_CONDA_CACHEDIR' } singularity { enabled = true autoMounts = true } process { executor = 'pbspro' beforeScript = { """ source $HOME/.bashrc source $HOME/.profile """ } scratch = false cleanup = false }
Example of a samplesheet.csv file:
patient,sample,lane,fastq_1,fastq_2 healthy_11,1,1,/path/to/data/1.Healthy/Healthy_Combined_11_sorted_R1.fastq.gz,/path/to/data/1.Healthy/Healthy_Combined_11_sorted_R2.fastq.gz
Prepare a samplesheet.csv file that contains the information of all the samples to be processed. Once ready, submit the job to the PBS scheduler:
qsub launch_phase1.pbs
PHASE II - variant calling
Prepare/edit the following launch_phase2.pbs script:
#!/bin/bash -l #PBS -N sarek_II #PBS -l walltime=48:00:00 #PBS -l select=1:ncpus=1:mem=5gb cd $PBS_O_WORKDIR NXF_OPTS='-Xms1g -Xmx4g' module load java #specify the nextflow version to use to run the workflow export NXF_VER=22.06.1-edge #run the sarek pipeline nextflow run nf-core/sarek \ -r 3.1.1 \ -profile singularity \ --genome GATK.GRCh38 \ --step variant_calling \ --tools haplotypecaller \ --wes \ -resume
Note: Sarek will automatically detect the input for the variant calling phase based on the results from the phase 1 outputs (i.e., results/csv/recalibrate.csv)
Submit the job to the PBS scheduler:
qsub launch_phase2.pbs
monitor the progress on the HPC:
qjobs
Alternatively, view the progress of the submitted job on the Nextflow Tower.
PHASE III - annotation
Download the singularity container for VEP in your catched Nextflow Singularity folder
singularity pull --name nfcore-vep-106.1.GRCh38.img docker://nfcore/vep:106.1.GRCh38
Prepare/edit the following launch_phase3.pbs script:
#!/bin/bash -l #PBS -N sarek_III #PBS -l walltime=48:00:00 #PBS -l select=1:ncpus=1:mem=5gb cd $PBS_O_WORKDIR NXF_OPTS='-Xms1g -Xmx4g' module load java #specify the nextflow version to use to run the workflow export NXF_VER=22.06.1-edge #run the sarek pipeline nextflow run nf-core/sarek \ -r 3.1.1 \ -profile singularity \ --genome GATK.GRCh38 \ --step annotate \ --tools vep,snpeff \ --wes \ -resume
Similarly to Phase 2, the Sarek pipeline will automatically detect the VCF input file for running annotation using the selected tool(s).
Submit the job to the PBS scheduler:
qsub launch_phase3.pbs
monitor the progress on the HPC:
qjobs
Alternatively, view the progress of the submitted job on the Nextflow Tower.