MAF - adding allele frequency to VCF files

Goals

  • Individual-level MAF: Assign minimal allele frequency (MAF) information to a VCF file generated using the nf-core/sarek pipeline

  • Population-level MAF: Add population-based MAF to annotated variants using public databases such as Mafdb.gnomAD and dbSNP

VCF files generated by the nextflow nf-core/sarek pipeline

In the sarek pipeline, for example, VCF files generated by HaplotypeCaller can be found at:

results/VariantCalling/sampleID/HaplotypeCaller/HaplotypeCaller_sampleID.vcf

Annotated VCF files using Variant Effect Predictor (VEP) can be found at:

results/Annotation/sampleID/VEP/HaplotypeCaller_sampleID_VEP.ann.vcf

Alternatively if annotated using snpEff

results/Annotation/sampleID/snpEff/HaplotypeCaller_sampleID_snpEff.ann.vcf

Example of VCF file annotated using snpEff. Note the metadata header information is not shown.

Individual-level MAF: adding MAF information using bcftools

At the individual-level expected allelic frequencies for alternative alleles (minor alleles) can be either 1.0 (homozygous), 0.5 (heterozygous) or 0.0 (absent) if it is absent in the patient when screening againts a reference set of genetic variants.

The following command line adds MAF information to either annotated or non-annotated VCF files.

For the example shown above, find the added MAF information to the VCF annotated using snpEff.

Adding population-level MAF

We will be exploring two methods to achive this, one is an R-based approach (see below) and another will use the vcf2maf tool (https://github.com/mskcc/vcf2maf) to assign allelic frequencies using reference population-level MAF such as Maf.gnomAD or Maf.ExAC when screened againts of set of know variants in dbSNP.

Method 1: using R to collect population-level MAF information

Requirements:

Open an Rstudio session and let’s initially install bioconductor:

Then install the MafDb.gnomAD package for the human GRCh38 genome assembly:

Next install the SNPlocs.Hsapiens.dbSNP150.GRCh38 package. Note the latest version 151 is approx. 3.8GB in size and can take a while to download. For the execise use verion 150 (total size 2GB).

Create a mafdb by loading the installed MafDb.gnomAD package:

Looking for the MAF information for a known variant of interest, for example: rs1129038

To look for the populion MAF information for several variants of interest create a R vector, for example:

then re-run STEP5 above as follows:

export MAF information to a data frame, modify the header “AF” (Allelic Frequency) to MAF and save into a file: