Table of Contents
Aims
Implement a bioinformatics pipeline for the detection of EBV-integration sites in the human genome
Generate ONT simulated data
squigulator is a tool for simulating nanopore raw signal data. It is under development and there could be interface changes and changes to default parameters. Read more here: https://github.com/hasindu2008/squigulator
Install a precompiled copy:
VERSION=0.3.0 wget https://github.com/hasindu2008/squigulator/releases/download/v${VERSION}/squigulator-v${VERSION}-x86_64-linux-binaries.tar.gz tar xf squigulator-v${VERSION}-x86_64-linux-binaries.tar.gz && cd squigulator-v${VERSION} ./squigulator --help
Location: /work/GRC_collaborations/EBV/tools/squigulator-v0.3.0
PHASE 1: No mutations introduced to reference genomes prior simulation of ONT data.
Genomes:
GRCh38.p14
EBV_ASM240226v1
GRCh38+EBV_ASM240226v1 (integrated in the genome)
GRCh38.p14 ONT simulated data:
Location:
/work/GRC_collaborations/EBV/analysis/1.squigulator_simulated_data_NoVariants/run1_GRCh38.p14_human_genome
Squigulator simulation:
#!/bin/bash -l #PBS -N squigulator_GRCh38.p14 #PBS -l walltime=48:00:00 #PBS -l mem=64gb #PBS -l ncpus=32 #use current working directory cd $PBS_O_WORKDIR ################################################# ## user defined variables ################################################# SQUIGULATOR='/work/GRC_collaborations/EBV/tools/squigulator-v0.3.0/squigulator' SAMPLEID='GRCh38.p14' GENOME='/work/GRC_collaborations/EBV/Genomes/GCF_000001405.40_GRCh38.p14_genomic.fna' COVERAGE=30 PROFILE='dna-r10-prom' ################################################# #STEP1: Create simulated reads at 30X genome coverage #example code: #squigulator hg38noAlt.fa -x dna-r10-prom -o reads.blow5 -f 30 #we use the user defined variables above to modify the example code: $SQUIGULATOR $GENOME -x $PROFILE \ -o ${SAMPLEID}_ONT_${PROFILE}_reads.blow5 \ -f $COVERAGE \ -t 32 \ -q ${SAMPLEID}_ONT_${PROFILE}_reads.fasta \ -c ${SAMPLEID}_ONT_${PROFILE}_reads_aln.paf \ -a ${SAMPLEID}_ONT_${PROFILE}_reads_aln.sam
EBV_ASM240226v1 ONT simulated data:
Location:
/work/GRC_collaborations/EBV/analysis/1.squigulator_simulated_data_NoVariants/run2_ASM240226v1_viral_genome
Squigulator simulation:
#!/bin/bash -l #PBS -N squigulator_EBV #PBS -l walltime=24:00:00 #PBS -l mem=32gb #PBS -l ncpus=16 #use current working directory cd $PBS_O_WORKDIR ################################################# ## user defined variables ################################################# SQUIGULATOR='/work/GRC_collaborations/EBV/tools/squigulator-v0.3.0/squigulator' SAMPLEID='EBV_ASM240226v1' GENOME='/work/GRC_collaborations/EBV/Genomes/GCF_002402265.1_ASM240226v1_genomic.fna' COVERAGE=30 PROFILE='dna-r10-prom' ################################################# #STEP1: Create simulated reads at 30X genome coverage #example code: #squigulator hg38noAlt.fa -x dna-r10-prom -o reads.blow5 -f 30 #we use the user defined variables above to modify the example code: $SQUIGULATOR $GENOME -x $PROFILE \ -o ${SAMPLEID}_ONT_${PROFILE}_reads.blow5 \ -f $COVERAGE \ -t 16 \ -q ${SAMPLEID}_ONT_${PROFILE}_reads.fasta \ -c ${SAMPLEID}_ONT_${PROFILE}_reads_aln.paf \ -a ${SAMPLEID}_ONT_${PROFILE}_reads_aln.sam
GRCh38.p14+EBV_ASM240226v1 ONT simulated data:
Locations:
/work/GRC_collaborations/EBV/analysis/1.squigulator_simulated_data_NoVariants/run3_custom_genome_human+virus
Squigulator simulation:
#!/bin/bash -l #PBS -N squigulator_GRCh38.p14_EBV #PBS -l walltime=48:00:00 #PBS -l mem=64gb #PBS -l ncpus=32 #use current working directory cd $PBS_O_WORKDIR ################################################# ## user defined variables ################################################# SQUIGULATOR='/work/GRC_collaborations/EBV/tools/squigulator-v0.3.0/squigulator' SAMPLEID='GRCh38.p14_EBV_custom_genome' GENOME='/work/GRC_collaborations/EBV/Genomes/custom_genome_one_GCF.fna' COVERAGE=30 PROFILE='dna-r10-prom' ################################################# #STEP1: Create simulated reads at 30X genome coverage #example code: #squigulator hg38noAlt.fa -x dna-r10-prom -o reads.blow5 -f 30 #we use the user defined variables above to modify the example code: $SQUIGULATOR $GENOME -x $PROFILE \ -o ${SAMPLEID}_ONT_${PROFILE}_reads.blow5 \ -f $COVERAGE \ -t 32 \ -q ${SAMPLEID}_ONT_${PROFILE}_reads.fasta \ -c ${SAMPLEID}_ONT_${PROFILE}_reads_aln.paf \ -a ${SAMPLEID}_ONT_${PROFILE}_reads_aln.sam
Outputs include:
BLOW5: simulated ONT data using dna-r10-prom profile at 30X genome coverage (*
reads.blow5
)FASTA: FASTA file to write simulated reads with no errors (*
reads.fasta
)PAF: PAF file to write the alignment of simulated reads (*
reads_aln.paf
)SAM: SAM file to write the alignment of simulated reads (*
reads_aln.sam
)
Converting BLOW5 to FAST5 data
First, let’s install ‘slow5tools’ from GitHub:
VERSION=v1.1.0 wget "https://github.com/hasindu2008/slow5tools/releases/download/$VERSION/slow5tools-$VERSION-x86_64-linux-binaries.tar.gz" && tar xvf slow5tools-$VERSION-x86_64-linux-binaries.tar.gz && cd slow5tools-$VERSION/ ./slow5tools
Optionally, copy the ‘slow5tools’ executable to your home bin:
cp slow5tools $HOME/bin/
Example data:
/work/GRC_collaborations/EBV/analysis/1.squigulator_simulated_data_NoVariants/coverage_5x/run2_ASM240226v1_viral_genome/EBV_ASM240226v1_ONT_dna-r10-prom_reads_5x.blow5
slow5tools parameters:
./slow5tools --help Usage: ./slow5tools [OPTIONS] [COMMAND] [ARG] Tools for using slow5 files. OPTIONS: -h, --help Display this message and exit. -v, --verbose Verbosity level. -V, --version Output version information and exit. --cite Prints the citation. COMMANDS: f2s or fast5toslow5 convert fast5 file(s) to SLOW5/BLOW5 s2f or slow5tofast5 convert SLOW5/BLOW5 file(s) to fast5 merge merge SLOW5/BLOW5 files split split SLOW5/BLOW5 files index create a SLOW5/BLOW5 index file get display the read entry for each specified read id view view the contents of a SLOW5/BLOW5 file or convert between different SLOW5/BLOW5 formats and compressions stats prints statistics of a SLOW5/BLOW5 file to the stdout cat quickly concatenate SLOW5/BLOW5 files of same type (same header, extension, compression)] quickcheck quickly checks if a SLOW5/BLOW5 file is intact skim skims through requested components in a SLOW5/BLOW5 file ARGS: Try './slow5tools [COMMAND] --help' for more information.