ONTvariants - epi2me WF-Human Variation pipeline

For this exercise we will use the epi2me-labs/wf-human-variation pipeline. Find information on the pipeline at https://labs.epi2me.io/workflows/wf-human-variation/

The pipeline is designed to analyse human genomic data. Specifically the workflow can perform the following:

diploid variant calling
structural variant calling
analysis of modified base calls
copy number variant calling
short tandem repeat (STR) expansion genotyping

Compute requirements

Recommended requirements:

CPUs = 32
Memory = 128GB

Minimum requirements:

CPUs = 16
Memory = 32GB

Approximate run time: Variable depending on whether it is targeted sequencing or whole genome sequencing, as well as coverage and the individual analyses requested. For instance, a 90X human sample run (options: --snp --sv --mod --str --cnv --phased --sex male) takes less than 8h with recommended resources.

NOTE: in contrast to the nf-core/sarek pipeline that we used in session 2, the epi2me-labs/wf-human-variation pipeline runs in ‘local’ mode (needs large amount of CPUs and RAM memory), while the nf-core/sarek pipeline will use a ‘pbspro’ mode, where the pipeline will submit individual jobs to the HPC cluster and define the CPUs and memory for each task individually.

epi2me-labs/wf-human-variation

Nextflow pipelines have the --help option to print the parameter options that are available to the user to analyse input data.

First we need to load java:

module load java

Now run the following command to see the pipeline options:

nextflow run epi2me-labs/wf-human-variation -profile singularity --help

N E X T F L O W  ~  version 23.12.0-edge
Launching `https://github.com/epi2me-labs/wf-human-variation` [amazing_fourier] DSL2 - revision: 5651930a05 [master]
WARN: Config setting `prov.formats` is not defined, no provenance reports will be produced

||||||||||   _____ ____ ___ ____  __  __ _____      _       _
||||||||||  | ____|  _ \_ _|___ \|  \/  | ____|    | | __ _| |__  ___
|||||       |  _| | |_) | |  __) | |\/| |  _| _____| |/ _` | '_ \/ __|
|||||       | |___|  __/| | / __/| |  | | |__|_____| | (_| | |_) \__ \
||||||||||  |_____|_|  |___|_____|_|  |_|_____|    |_|\__,_|_.__/|___/
||||||||||  wf-human-variation v2.2.0-g5651930
--------------------------------------------------------------------------------
Typical pipeline command:

  nextflow run epi2me-labs/wf-human-variation \ 
        --bam 'wf-human-variation-demo/demo.bam' \ 
        --basecaller_cfg 'dna_r10.4.1_e8.2_400bps_hac_prom' \ 
        --mod \ 
        --ref 'wf-human-variation-demo/demo.fasta' \ 
        --sample_name 'DEMO' \ 
        --snp \ 
        --sv

Workflow Options
  --sv                         [boolean] Call for structural variants.
  --snp                        [boolean] Call for small variants
  --cnv                        [boolean] Call for copy number variants.
  --str                        [boolean] Enable Straglr to genotype STR expansions.
  --mod                        [boolean] Enable output of modified calls to a bedMethyl file [requires input BAM with Ml and Mm tags]

Main options
  --sample_name                [string]  Sample name to be displayed in workflow outputs. [default: SAMPLE]
  --bam                        [string]  BAM or unaligned BAM (uBAM) files for the sample to use in the analysis.
  --ref                        [string]  Path to a reference FASTA file.
  --basecaller_cfg             [choice]  Name of the model to use for selecting a small variant calling model. [default: 
                                         dna_r10.4.1_e8.2_400bps_sup@v4.1.0] 
                                         * dna_r10.4.1_e8.2_260bps_fast@v4.1.0
                                         * dna_r10.4.1_e8.2_260bps_hac@v4.1.0
                                         * dna_r10.4.1_e8.2_260bps_sup@v4.1.0
                                         * dna_r10.4.1_e8.2_400bps_fast@v4.1.0
                                         * dna_r10.4.1_e8.2_400bps_fast@v4.2.0
                                         * dna_r10.4.1_e8.2_400bps_fast@v4.3.0
                                         * dna_r10.4.1_e8.2_400bps_hac@v4.1.0
                                         * dna_r10.4.1_e8.2_400bps_hac@v4.3.0
                                         * dna_r10.4.1_e8.2_400bps_sup@v4.1.0
                                         * dna_r10.4.1_e8.2_400bps_sup@v4.3.0
                                         * dna_r9.4.1_e8_fast@v3.4
                                         * dna_r9.4.1_e8_hac@v3.3
                                         * dna_r9.4.1_e8_sup@v3.3
                                         * dna_r9.4.1_e8_sup@v3.6
                                         * custom
                                         * dna_r10.4.1_e8.2_260bps_hac@v4.0.0
                                         * dna_r10.4.1_e8.2_260bps_sup@v4.0.0
                                         * dna_r10.4.1_e8.2_400bps_hac
                                         * dna_r10.4.1_e8.2_400bps_hac@v3.5.2
                                         * dna_r10.4.1_e8.2_400bps_hac@v4.0.0
                                         * dna_r10.4.1_e8.2_400bps_hac@v4.2.0
                                         * dna_r10.4.1_e8.2_400bps_hac_prom
                                         * dna_r10.4.1_e8.2_400bps_sup@v3.5.2
                                         * dna_r10.4.1_e8.2_400bps_sup@v4.0.0
                                         * dna_r10.4.1_e8.2_400bps_sup@v4.2.0
                                         * dna_r9.4.1_450bps_hac
                                         * dna_r9.4.1_450bps_hac_prom
  --bam_min_coverage           [number]  Minimum read coverage required to run analysis. [default: 20]
  --bed                        [string]  An optional BED file enumerating regions to process for variant calling.
  --annotation                 [boolean] SnpEff annotation. [default: true]
  --phased                     [boolean] Perform phasing.
  --include_all_ctgs           [boolean] Call for variants on all sequences in the reference, otherwise small and structural variants will only be called on 
                                         chr{1..22,X,Y,MT}. 
  --output_gene_summary        [boolean] If set to true, the workflow will generate gene-level coverage summaries.
  --out_dir                    [string]  Directory for output of all workflow results. [default: output]

Structural variant calling options
  --tr_bed                     [string]  Input BED file containing tandem repeat annotations for the reference genome.

Structural variant benchmarking options
  --sv_benchmark               [boolean] Benchmark called structural variants.

Copy number variant calling options
  --use_qdnaseq                [boolean] Use QDNAseq for CNV calling.
  --qdnaseq_bin_size           [choice]  Bin size for QDNAseq in kbp. [default: 500]
                                         * 1
                                         * 5
                                         * 10
                                         * 15
                                         * 30
                                         * 50
                                         * 100
                                         * 500
                                         * 1000

Modified base calling options
  --force_strand               [boolean] Require modkit to call strand-aware modifications.

Short tandem repeat expansion genotyping options
  --sex                        [choice]  Sex (XX or XY) to be passed to Straglr-genotype.
                                         * XY
                                         * XX

Advanced Options
  --depth_intervals            [boolean] Output a bedGraph file with entries for each genomic interval featuring homogeneous depth.
  --GVCF                       [boolean] Enable to output a gVCF file in addition to the VCF outputs (experimental).
  --downsample_coverage        [boolean] Downsample the coverage to along the genome.
  --downsample_coverage_target [number]  Average coverage or reads to use for the analyses. [default: 60]

Multiprocessing Options
  --threads                    [integer] Set max number of threads to use for more intense processes (limited by config executor cpus) [default: 4]
  --ubam_map_threads           [integer] Set max number of threads to use for aligning reads from uBAM (limited by config executor cpus) [default: 8]
  --ubam_sort_threads          [integer] Set max number of threads to use for sorting and indexing aligned reads from uBAM (limited by config executor cpus) 
                                         [default: 3] 
  --ubam_bam2fq_threads        [integer] Set max number of threads to use for uncompressing uBAM and generating FASTQ for alignment (limited by config executor 
                                         cpus) [default: 1] 
  --merge_threads              [integer] Set max number of threads to use for merging alignment files (limited by config executor cpus) [default: 4]
  --modkit_threads             [integer] Total number of threads to use in modkit modified base calling (limited by config executor cpus) [default: 4]

Miscellaneous Options
  --disable_ping               [boolean] Enable to prevent sending a workflow ping.

Other parameters
  --monochrome_logs            [boolean] null
  --validate_params            [boolean] null [default: true]
  --show_hidden_params         [boolean] null

!! Hiding 28 params, use --show_hidden_params to show them !!
--------------------------------------------------------------------------------
If you use epi2me-labs/wf-human-variation for your analysis please cite:

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

Running the pipeline

The epi2me-labs/wf-human-variation pipeline uses as input aligned BAM files. We will use the BAM file we generated in the “ONTvariants - mapping” exercise as the input for this pipeline.

Let’s change directory to the run3_variant_calling folder:

cd $HOME/workshop/ONTvariants/runs/run3_variant_calling

Copy the script for the exercise:

cp /work/traing/ONTvariants/scripts/launch_ONTvariants_epi2me_WF-HV.pbs .

Print the content of the script:

#!/bin/bash -l
#PBS -N run3_epi2me_WF-HV
#PBS -l select=1:ncpus=16:mem=32gb
#PBS -l walltime=48:00:00
#PBS -m abe

module load java
export NXF_OPTS='-Xms1g -Xmx4g'
cd $PBS_O_WORKDIR

nextflow run epi2me-labs/wf-human-variation \
    -r v2.2.0 \
    -profile singularity \
    --bam "/training/ONTvariants/runs/run2_mapping/SRR17138639_mapped_chr20.bam" \
    --basecaller_cfg "dna_r10.4.1_e8.2_400bps_hac_prom" \
    --mod \
    --ref "/training/ONTvariants/data/chr20.fasta" \
    --sample_name "SRR17138639" \
    --snp \
    --sv \
    --bam_min_coverage 15 \
    --out_dir "results"

Note:

Line 1: Defines that the script is a bash script.
Lines 2-5: Are commented out with “#” at the beginning and are ignored by bash, however, these PBS lines tell the scholar (PBS Pro) the name of the job (line 2), the number of CPUs and RAM memory to use (line 3), the time to run the script (line 4) and report if there are any errors (line 5).
Line 7: load java required to run nextflow pipelines.
Line 8: assign up to 4GB memory for the nextflow initial script to use.
Line 9: Tells the job to run on the current directory.
Lines 11-22: Parameters to run the epi2me-labs/wf-human-variation pipeline (refer above for details on each parameter)

Submit the QC job to the HPC cluster:

qsub launch_ONTvariants_mapping.pbs

Monitor the progress of the job:

qjobs

Once the pipeline has completed you will see the following set of output files in the ‘results’ folder:

.
├── execution
│   ├── report.html
│   ├── timeline.html
│   └── trace.txt
├── jbrowse.json
├── OPTIONAL_FILE
├── SRR17138639.flagstat.tsv
├── SRR17138639.mosdepth.global.dist.txt
├── SRR17138639.mosdepth.summary.txt
├── SRR17138639.readstats.tsv.gz
├── SRR17138639.regions.bed.gz
├── SRR17138639.stats.json
├── SRR17138639.thresholds.bed.gz
├── SRR17138639.wf-human-alignment-report.html
├── SRR17138639.wf-human-snp-report.html
├── SRR17138639.wf-human-sv-report.html
├── SRR17138639.wf_snp_clinvar.vcf
├── SRR17138639.wf_snp.vcf.gz
├── SRR17138639.wf_snp.vcf.gz.tbi
├── SRR17138639.wf_sv.vcf.gz
└── SRR17138639.wf_sv.vcf.gz.tbi

Let’s inspect the HTML reports for wf-human-alignment-report.html, wf-human-snp-report.htmland wf-human-sv-report.html.

NOTE: To proceed, you need to be on QUT’s WiFi network or signed via VPN.

To browse the working folder in the HPC type in the file finder:

Windows PC

\\hpc-fs\work\training\ONTvariants

Mac

smb://hpc-fs/work/training/ONTvariants

Now browse to the runs → run3_variant_calling folder → results folder and open the HTML reports.

├── SRR17138639.wf-human-alignment-report.html
├── SRR17138639.wf-human-snp-report.html
├── SRR17138639.wf-human-sv-report.html

Notes:

Transitions (Ti or Ts) vs transversions (Tv) mutations - typically a Whole Genome Sequencing (WGS) study finds a Ti/Tv ratio of 2.1, while exome studies detect Ti/Tv = 2.8

ClinVar = The pipeline reports mutations overlapping known Clinical variants of interest (see: wf-human-snp-report.html)
Structural variants : The dataset used in this workshop does not contain real SVs, rather it reports Insertions or Deletions in regions where there are “N” bases on chromosome 20. For example:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SRR17138639
chr20   71803   Sniffles2.DEL.10B2S0    TGAAAAGCTAAATTAAACTAATTAAGCTAAAG        N       39.0    PASS    PRECISE;SVTYPE=DEL;SVLEN=-32;END=7183
chr20   72147   Sniffles2.INS.7S0       N       AAAAGTCAAAAAATAACAGACACTGGTATACAGAAGAAAAGGAACACTTATACAC 40.0    PASS    PRECISE;SVTYPE=INS;SV
chr20   97733   Sniffles2.DEL.10B8S0    TAAGTCCCGCATGCATTAGCTATTTGTCTTAATGCTCTG N       39.0    PASS    PRECISE;SVTYPE=DEL;SVLEN=-39;END=9777
chr20   105784  Sniffles2.DEL.10BBS0    TGTTGGGCCTGGAGAGAGTGGGACCACCTTTGCCATGGGACTGGGGTGCATCTGTTCTGCAGGCCCTCCTACCTGTAGCCCCTCCGAAGGCCCCTGCCTAG
chr20   161017  Sniffles2.DEL.10C3S0    ATCCATTTCTTCTAGATTTTCTAGTTTATTTGTGTAGAGGTGTTTGTAGTATTCTCTGATGGTAGTTTGTATTTCTGTGGGATCGGTGGTGATATCCCCTT
chr20   173685  Sniffles2.INS.28S0      N       GGCAAGTACGGCACTGGGGGGCAGAACCCCCAACTTCTCCATGTCTCTACCCCTTCTCTTTTCTTGGGGAGACTGGCTTTTCCCAACCCCTTC
chr20   173712  Sniffles2.INS.27S0      N       TGTCTCTGCCCCTTCTCCACTTTTCTGGGGGCGAGAACCCCCAACCCCTTCTCCTTCACCCTTAGTGGCAATTACCGCTTTTCTGAGGGGCAA
chr20   174783  Sniffles2.INS.2BS0      N       GGAGCTTGCTACAAGCGCCAGAAATCTGGCCACCAGGCCAAGAATGTCCGCAGCCTGGGATTCCTCCTAAGCCGCGTCCCATCTGTGAAGGAC
chr20   175777  Sniffles2.INS.2DS0      N       GGATACTTTTTGACTTCGAAACCTGGTTTTGCCATCCTAATAAAACCATTATATAAACTCACAAAAAGGAAACCTAGCTGACCCCATAGATCC
chr20   176457  Sniffles2.INS.30S0      N       AATTGACTTTACTCACATGCCCCGGATCAGAAAACTAAAATACCTCTTAGTCTAGGTAGACACTTTCACTGGATAGGTAGGGCCTTTCCCACA
chr20   176457  Sniffles2.INS.2FS0      N       TGAGATGCTACAGGAGTGGTCCATTTGAACTTTTATATGGACACTTTCTTGCTTGGCCCCAACCTCATCCCAGACACCAGCCCTCTAGGTGAC
chr20   177062  Sniffles2.DEL.10CAS0    GCCCAACTACACACATCACTGAAACAATAGGAGCCTTCCAGCTACATATTACAGACAAGCCCTCTATCAATACTGGCAAACTTAAAAACATTAGCTGTAAT
chr20   178476  Sniffles2.DEL.10CDS0    GAAGTAACTGAAGAATCACCAAAGAAGTGAAAGTGGCCT N       59.0    PASS    PRECISE;SVTYPE=DEL;SVLEN=-39;END=1785
chr20   178495  Sniffles2.INS.37S0      N       AAAAGAATGAATATGCCCTGCCCCACCTTAACTGATGACATTCCACCACAAAGAAGTGTAAATGGCCGGTCATGCACCTTAACTGATGACATT
chr20   183223  Sniffles2.INS.38S0      N       ATCAAAAAGCCATTCAAATGGATTCACAGCTGAATTCTACCAGATGTATAAAGAACTGATACCAACTTATTGAAACTATTCCAAAATACGGAG
chr20   184725  Sniffles2.INS.3DS0      N       AAAGCATTGAGATGTTTATGTGTATGCATATCCAAAAAGCACAGCATAATCCTTTACATTGTCTATGATGCCAAGACCTTTGTTCACGTGTTT
chr20   185082  Sniffles2.INS.3BS0      N       AAGGAAGAAAACCAGGCTGGGCACAACGGCTCATGCCTCAAATCTCAATACATTGGCAAGCCAAGTAGAGGATCATTTGTTTCTCAGTTGTTC
chr20   190692  Sniffles2.DUP.2114S0    N       <DUP>   53.0    PASS    PRECISE;SVTYPE=DUP;SVLEN=25400936;END=25591628;SUPPORT=8;RNAMES=SRR17