EBV genome integration data analysis

Table of Contents

Aims

  • Implement a bioinformatics pipeline for the detection of EBV-integration sites in the human genome

Generate ONT simulated data

squigulator is a tool for simulating nanopore raw signal data. It is under development and there could be interface changes and changes to default parameters. Read more here: https://github.com/hasindu2008/squigulator

Install a precompiled copy:

VERSION=0.3.0 wget https://github.com/hasindu2008/squigulator/releases/download/v${VERSION}/squigulator-v${VERSION}-x86_64-linux-binaries.tar.gz tar xf squigulator-v${VERSION}-x86_64-linux-binaries.tar.gz && cd squigulator-v${VERSION} ./squigulator --help

Location: /work/GRC_collaborations/EBV/tools/squigulator-v0.3.0

PHASE 1: No mutations introduced to reference genomes prior simulation of ONT data.

Genomes:

  • GRCh38.p14

  • EBV_ASM240226v1

  • GRCh38+EBV_ASM240226v1 (integrated in the genome)

GRCh38.p14 ONT simulated data:

Location:

/work/GRC_collaborations/EBV/analysis/1.squigulator_simulated_data_NoVariants/run1_GRCh38.p14_human_genome

Squigulator simulation:

#!/bin/bash -l #PBS -N squigulator_GRCh38.p14 #PBS -l walltime=48:00:00 #PBS -l mem=64gb #PBS -l ncpus=32 #use current working directory cd $PBS_O_WORKDIR ################################################# ## user defined variables ################################################# SQUIGULATOR='/work/GRC_collaborations/EBV/tools/squigulator-v0.3.0/squigulator' SAMPLEID='GRCh38.p14' GENOME='/work/GRC_collaborations/EBV/Genomes/GCF_000001405.40_GRCh38.p14_genomic.fna' COVERAGE=30 PROFILE='dna-r10-prom' ################################################# #STEP1: Create simulated reads at 30X genome coverage #example code: #squigulator hg38noAlt.fa -x dna-r10-prom -o reads.blow5 -f 30 #we use the user defined variables above to modify the example code: $SQUIGULATOR $GENOME -x $PROFILE \ -o ${SAMPLEID}_ONT_${PROFILE}_reads.blow5 \ -f $COVERAGE \ -t 32 \ -q ${SAMPLEID}_ONT_${PROFILE}_reads.fasta \ -c ${SAMPLEID}_ONT_${PROFILE}_reads_aln.paf \ -a ${SAMPLEID}_ONT_${PROFILE}_reads_aln.sam

EBV_ASM240226v1 ONT simulated data:

Location:

Squigulator simulation:

GRCh38.p14+EBV_ASM240226v1 ONT simulated data:

Locations:

Squigulator simulation:

Outputs include:

  • BLOW5: simulated ONT data using dna-r10-prom profile at 30X genome coverage (*reads.blow5)

  • FASTA: FASTA file to write simulated reads with no errors (*reads.fasta)

  • PAF: PAF file to write the alignment of simulated reads (*reads_aln.paf)

  • SAM: SAM file to write the alignment of simulated reads (*reads_aln.sam)

Converting BLOW5 to FAST5 data

First, let’s install ‘slow5tools’ from GitHub:

Optionally, copy the ‘slow5tools’ executable to your home bin:

Example data:

slow5tools parameters:

Convert slow5 to fast5 - options:

Running slow5rools: