DKE 121 genome assembly
Aim
Assemble the genome for the Dengue DKE121 strain using public amplicon seq data.
Background
A new highly divergent strain of dengue serotype 4 that was isolated in 2007 finally got published, so we have decided to include it in the analysis as well. Should be relatively easy to accomplish. However, they uploaded the raw read data to GenBank instead of a consensus sequence. It is all in one file of paired Illumina reads. I have attempted to modify the ConsGenome script, but only really eliminated anything that said READ2 and was hoping to see if it would run. But it probably needs a bit more finesse than that. Ha. The fastq.qz file is in the folder /work/phylo/NGS/ConsGenome/DKE121_attempt/. Basically, all I need is a consensus sequence generated from their raw reads. Then I can replicate everything else that I have already done. The project itself is at https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR14530608 if you need to look at it for any reason.
Public data
Details of project can be found at https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR14530608
Run | Spots | Bases | Size | GC content | Published | Access Type |
---|---|---|---|---|---|---|
SRR14530608 | 1.0M | 512.6Mbp | 298.3M | 47.6% | 2021-05-15 | public |
Fetch public data using sra toolkit
prefetch SRR14530608
Split downloaded sra file into FASTQ
fastq-dump --split-files SRR14530608
remove adaptor sequences using eresearch nextflow/trimgalore
#!/bin/bash -l
#PBS -N nftrimgalore
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=48:00:00
#Use the current directory to run the workflow
cd $PBS_O_WORKDIR
module load java
NXF_OPTS='-Xms1g -Xmx4g'
#run the nextflow trimgalore tool to remove adaptor sequences
nextflow run /home/barrero/nextflow/mt18005/trimgalore/main.nf --indexfile index.csv
Initial QC results
Overall there is a substantial 5’end bias of nucleotide composition owing to the amplicon approach used to generate the sample sequences.
There is a number of sequences with quality scores below minimal recommended Q20 (99.0% accuracy)
Over-represented sequences show abundant poly Guanosine sequences
READ1
READ2
MultiQC
ConsGenome pipeline
Upon initial QC of FASTQ files, these where subjected to a modified version of the ConsGenome pipeliene to assembly a consensus sequence and to build a reference guided consensus genome:
Associated config file: