MAF - adding allele frequency to VCF files
Goals
Individual-level MAF: Assign minimal allele frequency (MAF) information to a VCF file generated using the nf-core/sarek pipeline
Population-level MAF: Add population-based MAF to annotated variants using public databases such as Mafdb.gnomAD and dbSNP
VCF files generated by the nextflow nf-core/sarek pipeline
In the sarek pipeline, for example, VCF files generated by HaplotypeCaller can be found at:
results/VariantCalling/sampleID/HaplotypeCaller/HaplotypeCaller_sampleID.vcf
Annotated VCF files using Variant Effect Predictor (VEP) can be found at:
results/Annotation/sampleID/VEP/HaplotypeCaller_sampleID_VEP.ann.vcf
Alternatively if annotated using snpEff
results/Annotation/sampleID/snpEff/HaplotypeCaller_sampleID_snpEff.ann.vcf
Example of VCF file annotated using snpEff. Note the metadata header information is not shown.
Individual-level MAF: adding MAF information using bcftools
At the individual-level expected allelic frequencies for alternative alleles (minor alleles) can be either 1.0 (homozygous), 0.5 (heterozygous) or 0.0 (absent) if it is absent in the patient when screening againts a reference set of genetic variants.
The following command line adds MAF information to either annotated or non-annotated VCF files.
For the example shown above, find the added MAF information to the VCF annotated using snpEff.
Adding population-level MAF
We will be exploring two methods to achive this, one is an R-based approach (see below) and another will use the vcf2maf tool (https://github.com/mskcc/vcf2maf) to assign allelic frequencies using reference population-level MAF such as Maf.gnomAD or Maf.ExAC when screened againts of set of know variants in dbSNP.
Method 1: using R to collect population-level MAF information
Requirements:
Install Rstudio https://www.rstudio.com/products/rstudio/download/
R version 4.1 (Mac: https://www.youtube.com/watch?v=Vy-lEkJB3cA ; Windows: https://www.youtube.com/watch?v=0jlMXPMoiOg )
Open an Rstudio session and let’s initially install bioconductor:
Then install the MafDb.gnomAD package for the human GRCh38 genome assembly:
Next install the SNPlocs.Hsapiens.dbSNP150.GRCh38 package. Note the latest version 151 is approx. 3.8GB in size and can take a while to download. For the execise use verion 150 (total size 2GB).
Create a mafdb by loading the installed MafDb.gnomAD package:
Looking for the MAF information for a known variant of interest, for example: rs1129038
To look for the populion MAF information for several variants of interest create a R vector, for example:
then re-run STEP5 above as follows:
export MAF information to a data frame, modify the header “AF” (Allelic Frequency) to MAF and save into a file: