ONTvariants - epi2me WF-Human Variation pipeline

For this exercise we will use the epi2me-labs/wf-human-variation pipeline. Find information on the pipeline at https://labs.epi2me.io/workflows/wf-human-variation/

The pipeline is designed to analyse human genomic data. Specifically the workflow can perform the following:

diploid variant calling
structural variant calling
analysis of modified base calls
copy number variant calling
short tandem repeat (STR) expansion genotyping

Compute requirements

Recommended requirements:

CPUs = 32
Memory = 128GB

Minimum requirements:

CPUs = 16
Memory = 32GB

Approximate run time: Variable depending on whether it is targeted sequencing or whole genome sequencing, as well as coverage and the individual analyses requested. For instance, a 90X human sample run (options: --snp --sv --mod --str --cnv --phased --sex male) takes less than 8h with recommended resources.

NOTE: in contrast to the nf-core/sarek pipeline that we used in session 2, the epi2me-labs/wf-human-variation pipeline runs in ‘local’ mode (needs large amount of CPUs and RAM memory), while the nf-core/sarek pipeline will use a ‘pbspro’ mode, where the pipeline will submit individual jobs to the HPC cluster and define the CPUs and memory for each task individually.

epi2me-labs/wf-human-variation

Nextflow pipelines have the --help option to print the parameter options that are available to the user to analyse input data.

First we need to load java:

module load java

Now run the following command to see the pipeline options:

nextflow run epi2me-labs/wf-human-variation -profile singularity --help

N E X T F L O W  ~  version 23.12.0-edge
Launching `https://github.com/epi2me-labs/wf-human-variation` [amazing_fourier] DSL2 - revision: 5651930a05 [master]
WARN: Config setting `prov.formats` is not defined, no provenance reports will be produced

||||||||||   _____ ____ ___ ____  __  __ _____      _       _
||||||||||  | ____|  _ \_ _|___ \|  \/  | ____|    | | __ _| |__  ___
|||||       |  _| | |_) | |  __) | |\/| |  _| _____| |/ _` | '_ \/ __|
|||||       | |___|  __/| | / __/| |  | | |__|_____| | (_| | |_) \__ \
||||||||||  |_____|_|  |___|_____|_|  |_|_____|    |_|\__,_|_.__/|___/
||||||||||  wf-human-variation v2.2.0-g5651930
--------------------------------------------------------------------------------
Typical pipeline command:

  nextflow run epi2me-labs/wf-human-variation \ 
        --bam 'wf-human-variation-demo/demo.bam' \ 
        --basecaller_cfg 'dna_r10.4.1_e8.2_400bps_hac_prom' \ 
        --mod \ 
        --ref 'wf-human-variation-demo/demo.fasta' \ 
        --sample_name 'DEMO' \ 
        --snp \ 
        --sv

Workflow Options
  --sv                         [boolean] Call for structural variants.
  --snp                        [boolean] Call for small variants
  --cnv                        [boolean] Call for copy number variants.
  --str                        [boolean] Enable Straglr to genotype STR expansions.
  --mod                        [boolean] Enable output of modified calls to a bedMethyl file [requires input BAM with Ml and Mm tags]

Main options
  --sample_name                [string]  Sample name to be displayed in workflow outputs. [default: SAMPLE]
  --bam                        [string]  BAM or unaligned BAM (uBAM) files for the sample to use in the analysis.
  --ref                        [string]  Path to a reference FASTA file.
  --basecaller_cfg             [choice]  Name of the model to use for selecting a small variant calling model. [default: 
                                         dna_r10.4.1_e8.2_400bps_sup@v4.1.0] 
                                         * dna_r10.4.1_e8.2_260bps_fast@v4.1.0
                                         * dna_r10.4.1_e8.2_260bps_hac@v4.1.0
                                         * dna_r10.4.1_e8.2_260bps_sup@v4.1.0
                                         * dna_r10.4.1_e8.2_400bps_fast@v4.1.0
                                         * dna_r10.4.1_e8.2_400bps_fast@v4.2.0
                                         * dna_r10.4.1_e8.2_400bps_fast@v4.3.0
                                         * dna_r10.4.1_e8.2_400bps_hac@v4.1.0
                                         * dna_r10.4.1_e8.2_400bps_hac@v4.3.0
                                         * dna_r10.4.1_e8.2_400bps_sup@v4.1.0
                                         * dna_r10.4.1_e8.2_400bps_sup@v4.3.0
                                         * dna_r9.4.1_e8_fast@v3.4
                                         * dna_r9.4.1_e8_hac@v3.3
                                         * dna_r9.4.1_e8_sup@v3.3
                                         * dna_r9.4.1_e8_sup@v3.6
                                         * custom
                                         * dna_r10.4.1_e8.2_260bps_hac@v4.0.0
                                         * dna_r10.4.1_e8.2_260bps_sup@v4.0.0
                                         * dna_r10.4.1_e8.2_400bps_hac
                                         * dna_r10.4.1_e8.2_400bps_hac@v3.5.2
                                         * dna_r10.4.1_e8.2_400bps_hac@v4.0.0
                                         * dna_r10.4.1_e8.2_400bps_hac@v4.2.0
                                         * dna_r10.4.1_e8.2_400bps_hac_prom
                                         * dna_r10.4.1_e8.2_400bps_sup@v3.5.2
                                         * dna_r10.4.1_e8.2_400bps_sup@v4.0.0
                                         * dna_r10.4.1_e8.2_400bps_sup@v4.2.0
                                         * dna_r9.4.1_450bps_hac
                                         * dna_r9.4.1_450bps_hac_prom
  --bam_min_coverage           [number]  Minimum read coverage required to run analysis. [default: 20]
  --bed                        [string]  An optional BED file enumerating regions to process for variant calling.
  --annotation                 [boolean] SnpEff annotation. [default: true]
  --phased                     [boolean] Perform phasing.
  --include_all_ctgs           [boolean] Call for variants on all sequences in the reference, otherwise small and structural variants will only be called on 
                                         chr{1..22,X,Y,MT}. 
  --output_gene_summary        [boolean] If set to true, the workflow will generate gene-level coverage summaries.
  --out_dir                    [string]  Directory for output of all workflow results. [default: output]

Structural variant calling options
  --tr_bed                     [string]  Input BED file containing tandem repeat annotations for the reference genome.

Structural variant benchmarking options
  --sv_benchmark               [boolean] Benchmark called structural variants.

Copy number variant calling options
  --use_qdnaseq                [boolean] Use QDNAseq for CNV calling.
  --qdnaseq_bin_size           [choice]  Bin size for QDNAseq in kbp. [default: 500]
                                         * 1
                                         * 5
                                         * 10
                                         * 15
                                         * 30
                                         * 50
                                         * 100
                                         * 500
                                         * 1000

Modified base calling options
  --force_strand               [boolean] Require modkit to call strand-aware modifications.

Short tandem repeat expansion genotyping options
  --sex                        [choice]  Sex (XX or XY) to be passed to Straglr-genotype.
                                         * XY
                                         * XX

Advanced Options
  --depth_intervals            [boolean] Output a bedGraph file with entries for each genomic interval featuring homogeneous depth.
  --GVCF                       [boolean] Enable to output a gVCF file in addition to the VCF outputs (experimental).
  --downsample_coverage        [boolean] Downsample the coverage to along the genome.
  --downsample_coverage_target [number]  Average coverage or reads to use for the analyses. [default: 60]

Multiprocessing Options
  --threads                    [integer] Set max number of threads to use for more intense processes (limited by config executor cpus) [default: 4]
  --ubam_map_threads           [integer] Set max number of threads to use for aligning reads from uBAM (limited by config executor cpus) [default: 8]
  --ubam_sort_threads          [integer] Set max number of threads to use for sorting and indexing aligned reads from uBAM (limited by config executor cpus) 
                                         [default: 3] 
  --ubam_bam2fq_threads        [integer] Set max number of threads to use for uncompressing uBAM and generating FASTQ for alignment (limited by config executor 
                                         cpus) [default: 1] 
  --merge_threads              [integer] Set max number of threads to use for merging alignment files (limited by config executor cpus) [default: 4]
  --modkit_threads             [integer] Total number of threads to use in modkit modified base calling (limited by config executor cpus) [default: 4]

Miscellaneous Options
  --disable_ping               [boolean] Enable to prevent sending a workflow ping.

Other parameters
  --monochrome_logs            [boolean] null
  --validate_params            [boolean] null [default: true]
  --show_hidden_params         [boolean] null

!! Hiding 28 params, use --show_hidden_params to show them !!
--------------------------------------------------------------------------------
If you use epi2me-labs/wf-human-variation for your analysis please cite:

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

Running the pipeline

The epi2me-labs/wf-human-variation pipeline uses as input aligned BAM files. We will use the BAM file we generated in the “ONTvariants - mapping” exercise as the input for this pipeline.

Let’s change directory to the run3_variant_calling folder:

cd $HOME/workshop/ONTvariants/runs/run3_variant_calling

Copy the script for the exercise:

cp /work/traing/ONTvariants/scripts/launch_epi2me-labs_wf-human-variation.pbs .

Print the content of the script:

#!/bin/bash -l
#PBS -N run3_epi2me_WF-HV
#PBS -l select=1:ncpus=16:mem=32gb
#PBS -l walltime=48:00:00
#PBS -m abe

module load java

export NXF_OPTS='-Xms1g -Xmx4g'
cd $PBS_O_WORKDIR

nextflow run epi2me-labs/wf-human-variation \
    -r v2.2.0 \
    -profile singularity \
    --bam "/training/ONTvariants/runs/run2_mapping/SRR17138639_mapped_chr20.bam" \
    --basecaller_cfg "dna_r10.4.1_e8.2_400bps_hac_prom" \
    --mod \
    --ref "/training/ONTvariants/data/chr20.fasta" \
    --sample_name "SRR17138639" \
    --snp \
    --sv \
    --bam_min_coverage 15 \
    --out_dir "results"