Table of Contents
Table of Contents | ||
---|---|---|
|
Aims
Implement a bioinformatics pipeline for the detection of EBV-integration sites in the human genome
Generate ONT simulated data
squigulator is a tool for simulating nanopore raw signal data. It is under development and there could be interface changes and changes to default parameters. Read more here: https://github.com/hasindu2008/squigulator
Install a precompiled copy:
Code Block |
---|
VERSION=0.3.0
wget https://github.com/hasindu2008/squigulator/releases/download/v${VERSION}/squigulator-v${VERSION}-x86_64-linux-binaries.tar.gz
tar xf squigulator-v${VERSION}-x86_64-linux-binaries.tar.gz && cd squigulator-v${VERSION}
./squigulator --help |
Location: /work/GRC_collaborations/EBV/tools/squigulator-v0.3.0
PHASE 1: No mutations introduced to reference genomes prior simulation of ONT data.
Genomes:
GRCh38.p14
EBV_ASM240226v1
GRCh38+EBV_ASM240226v1 (integrated in the genome)
GRCh38.p14 ONT simulated data:
Location:
Code Block |
---|
/work/GRC_collaborations/EBV/analysis/1.squigulator_simulated_data_NoVariants/run1_GRCh38.p14_human_genome |
Squigulator simulation:
Code Block |
---|
#!/bin/bash -l
#PBS -N squigulator_GRCh38.p14
#PBS -l walltime=48:00:00
#PBS -l mem=64gb
#PBS -l ncpus=32
#use current working directory
cd $PBS_O_WORKDIR
#################################################
## user defined variables
#################################################
SQUIGULATOR='/work/GRC_collaborations/EBV/tools/squigulator-v0.3.0/squigulator'
SAMPLEID='GRCh38.p14'
GENOME='/work/GRC_collaborations/EBV/Genomes/GCF_000001405.40_GRCh38.p14_genomic.fna'
COVERAGE=30
PROFILE='dna-r10-prom'
#################################################
#STEP1: Create simulated reads at 30X genome coverage
#example code:
#squigulator hg38noAlt.fa -x dna-r10-prom -o reads.blow5 -f 30
#we use the user defined variables above to modify the example code:
$SQUIGULATOR $GENOME -x $PROFILE \
-o ${SAMPLEID}_ONT_${PROFILE}_reads.blow5 \
-f $COVERAGE \
-t 32 \
-q ${SAMPLEID}_ONT_${PROFILE}_reads.fasta \
-c ${SAMPLEID}_ONT_${PROFILE}_reads_aln.paf \
-a ${SAMPLEID}_ONT_${PROFILE}_reads_aln.sam |
EBV_ASM240226v1 ONT simulated data:
Location:
Code Block |
---|
/work/GRC_collaborations/EBV/analysis/1.squigulator_simulated_data_NoVariants/run2_ASM240226v1_viral_genome |
Squigulator simulation:
Code Block |
---|
#!/bin/bash -l
#PBS -N squigulator_EBV
#PBS -l walltime=24:00:00
#PBS -l mem=32gb
#PBS -l ncpus=16
#use current working directory
cd $PBS_O_WORKDIR
#################################################
## user defined variables
#################################################
SQUIGULATOR='/work/GRC_collaborations/EBV/tools/squigulator-v0.3.0/squigulator'
SAMPLEID='EBV_ASM240226v1'
GENOME='/work/GRC_collaborations/EBV/Genomes/GCF_002402265.1_ASM240226v1_genomic.fna'
COVERAGE=30
PROFILE='dna-r10-prom'
#################################################
#STEP1: Create simulated reads at 30X genome coverage
#example code:
#squigulator hg38noAlt.fa -x dna-r10-prom -o reads.blow5 -f 30
#we use the user defined variables above to modify the example code:
$SQUIGULATOR $GENOME -x $PROFILE \
-o ${SAMPLEID}_ONT_${PROFILE}_reads.blow5 \
-f $COVERAGE \
-t 16 \
-q ${SAMPLEID}_ONT_${PROFILE}_reads.fasta \
-c ${SAMPLEID}_ONT_${PROFILE}_reads_aln.paf \
-a ${SAMPLEID}_ONT_${PROFILE}_reads_aln.sam |
GRCh38.p14+EBV_ASM240226v1 ONT simulated data:
Locations:
Code Block |
---|
/work/GRC_collaborations/EBV/analysis/1.squigulator_simulated_data_NoVariants/run3_custom_genome_human+virus |
Squigulator simulation:
Code Block |
---|
#!/bin/bash -l
#PBS -N squigulator_GRCh38.p14_EBV
#PBS -l walltime=48:00:00
#PBS -l mem=64gb
#PBS -l ncpus=32
#use current working directory
cd $PBS_O_WORKDIR
#################################################
## user defined variables
#################################################
SQUIGULATOR='/work/GRC_collaborations/EBV/tools/squigulator-v0.3.0/squigulator'
SAMPLEID='GRCh38.p14_EBV_custom_genome'
GENOME='/work/GRC_collaborations/EBV/Genomes/custom_genome_one_GCF.fna'
COVERAGE=30
PROFILE='dna-r10-prom'
#################################################
#STEP1: Create simulated reads at 30X genome coverage
#example code:
#squigulator hg38noAlt.fa -x dna-r10-prom -o reads.blow5 -f 30
#we use the user defined variables above to modify the example code:
$SQUIGULATOR $GENOME -x $PROFILE \
-o ${SAMPLEID}_ONT_${PROFILE}_reads.blow5 \
-f $COVERAGE \
-t 32 \
-q ${SAMPLEID}_ONT_${PROFILE}_reads.fasta \
-c ${SAMPLEID}_ONT_${PROFILE}_reads_aln.paf \
-a ${SAMPLEID}_ONT_${PROFILE}_reads_aln.sam |
Outputs include:
BLOW5: simulated ONT data using dna-r10-prom profile at 30X genome coverage (*
reads.blow5
)FASTA: FASTA file to write simulated reads with no errors (*
reads.fasta
)PAF: PAF file to write the alignment of simulated reads (*
reads_aln.paf
)SAM: SAM file to write the alignment of simulated reads (*
reads_aln.sam
)
Converting BLOW5 to FAST5 data
First, let’s install ‘slow5tools’ from GitHub:
Code Block |
---|
VERSION=v1.1.0
wget "https://github.com/hasindu2008/slow5tools/releases/download/$VERSION/slow5tools-$VERSION-x86_64-linux-binaries.tar.gz" && tar xvf slow5tools-$VERSION-x86_64-linux-binaries.tar.gz && cd slow5tools-$VERSION/
./slow5tools |
Optionally, copy the ‘slow5tools’ executable to your home bin:
Code Block |
---|
cp slow5tools $HOME/bin/ |
Example data:
Code Block |
---|
/work/GRC_collaborations/EBV/analysis/1.squigulator_simulated_data_NoVariants/coverage_5x/run2_ASM240226v1_viral_genome/EBV_ASM240226v1_ONT_dna-r10-prom_reads_5x.blow5 |
slow5tools parameters:
Code Block |
---|
./slow5tools --help
Usage: ./slow5tools [OPTIONS] [COMMAND] [ARG]
Tools for using slow5 files.
OPTIONS:
-h, --help Display this message and exit.
-v, --verbose Verbosity level.
-V, --version Output version information and exit.
--cite Prints the citation.
COMMANDS:
f2s or fast5toslow5 convert fast5 file(s) to SLOW5/BLOW5
s2f or slow5tofast5 convert SLOW5/BLOW5 file(s) to fast5
merge merge SLOW5/BLOW5 files
split split SLOW5/BLOW5 files
index create a SLOW5/BLOW5 index file
get display the read entry for each specified read id
view view the contents of a SLOW5/BLOW5 file or convert between different SLOW5/BLOW5 formats and compressions
stats prints statistics of a SLOW5/BLOW5 file to the stdout
cat quickly concatenate SLOW5/BLOW5 files of same type (same header, extension, compression)]
quickcheck quickly checks if a SLOW5/BLOW5 file is intact
skim skims through requested components in a SLOW5/BLOW5 file
ARGS: Try './slow5tools [COMMAND] --help' for more information. |
Convert slow5 to fast5 - options:
Code Block |
---|
Convert SLOW5/BLOW5 files to FAST5 format.
Usage: ./slow5tools s2f [OPTIONS] -d [OUT_DIR] [SLOW5_FILE/DIR] ...
OPTIONS:
-d, --out-dir DIR output to directory
-o, --output FILE output to FILE [stdout]
-p, --iop INT number of I/O processes [8]
-h, --help display this message and exit |
Running slow5rools:
Code Block |
---|
#PBS -N Convert_EBV
#PBS -l walltime=24:00:00
#PBS -l mem=16gb
#PBS -l ncpus=8
#PBS -m abe
#PBS -o convert_blow5_to_fast5_output.log
#PBS -e convert_blow5_to_fast5_error.log
#Positive control
###Simulated data: convert BLOW5 to FAST5 for wf-basecalling
#### Note: Adjust the path directories
# Change to the directory where the job was submitted
cd $PBS_O_WORKDIR
# Load the necessary modules
module load slow5tools
# Define path directories as a variable
BLOW5_FILE='GRCh38.p14_ONT_dna-r10-prom_reads_5x.blow5'
FAST5_FILE='GRCh38.p14_ONT_dna-r10-prom_reads_5x.fast5'
SLOW5TOOLS='/work/GRC_collaborations/EBV/tools/slow5tools-v1.1.0/slow5tools'
# Ensure the output directory exists
# when multiple slow5 files are available then use -d outpur_dir option rather than -o converted.fastq file (see below)
#mkdir -p ${OUT_DIR}
# Run the conversion
#slow5tools tofast5 -o ${FAST5_FILE} ${BLOW5_FILE}
$SLOW5TOOLS s2f -p 8 -o ${FAST5_FILE} ${BLOW5_FILE} |