/
DKE 121 genome assembly

DKE 121 genome assembly

Aim

Assemble the genome for the Dengue DKE121 strain using public amplicon seq data.

Background

A new highly divergent strain of dengue serotype 4 that was isolated in 2007 finally got published, so we have decided to include it in the analysis as well. Should be relatively easy to accomplish. However, they uploaded the raw read data to GenBank instead of a consensus sequence. It is all in one file of paired Illumina reads. I have attempted to modify the ConsGenome script, but only really eliminated anything that said READ2 and was hoping to see if it would run. But it probably needs a bit more finesse than that. Ha. The fastq.qz file is in the folder /work/phylo/NGS/ConsGenome/DKE121_attempt/. Basically, all I need is a consensus sequence generated from their raw reads. Then I can replicate everything else that I have already done. The project itself is at https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR14530608  if you need to look at it for any reason.

Public data

Details of project can be found at https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR14530608

Run

Spots

Bases

Size

GC content

Published

Access Type

Run

Spots

Bases

Size

GC content

Published

Access Type

SRR14530608

1.0M

512.6Mbp

298.3M

47.6%

2021-05-15

public

Fetch public data using sra toolkit

prefetch SRR14530608

Split downloaded sra file into FASTQ

fastq-dump --split-files SRR14530608

remove adaptor sequences using eresearch nextflow/trimgalore

#!/bin/bash -l #PBS -N nftrimgalore #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=48:00:00 #Use the current directory to run the workflow cd $PBS_O_WORKDIR module load java NXF_OPTS='-Xms1g -Xmx4g' #run the nextflow trimgalore tool to remove adaptor sequences nextflow run /home/barrero/nextflow/mt18005/trimgalore/main.nf --indexfile index.csv

Initial QC results

  • Overall there is a substantial 5’end bias of nucleotide composition owing to the amplicon approach used to generate the sample sequences.

  • There is a number of sequences with quality scores below minimal recommended Q20 (99.0% accuracy)

  • Over-represented sequences show abundant poly Guanosine sequences

 

READ1

READ2

MultiQC

ConsGenome pipeline

Upon initial QC of FASTQ files, these where subjected to a modified version of the ConsGenome pipeliene to assembly a consensus sequence and to build a reference guided consensus genome:

Associated config file:

 

Related content

Equus caballus Project
Equus caballus Project
Read with this
nf-eresearch/ConsGenome: Nextflow based Genome Assembly, Variant Calling and building a Consensus Genome workflow
nf-eresearch/ConsGenome: Nextflow based Genome Assembly, Variant Calling and building a Consensus Genome workflow
More like this
ConsGenome: A Virus Genome Assembly, Variant Calling and building a Consensus Genome workflow
ConsGenome: A Virus Genome Assembly, Variant Calling and building a Consensus Genome workflow
More like this
nf-eresearch/ONTprocessing - NextFlow pipeline for Oxford Nanopore de novo assembly and ref guided consensus
nf-eresearch/ONTprocessing - NextFlow pipeline for Oxford Nanopore de novo assembly and ref guided consensus
More like this