/
DKE 121 genome assembly

DKE 121 genome assembly

Aim

Assemble the genome for the Dengue DKE121 strain using public amplicon seq data.

Background

A new highly divergent strain of dengue serotype 4 that was isolated in 2007 finally got published, so we have decided to include it in the analysis as well. Should be relatively easy to accomplish. However, they uploaded the raw read data to GenBank instead of a consensus sequence. It is all in one file of paired Illumina reads. I have attempted to modify the ConsGenome script, but only really eliminated anything that said READ2 and was hoping to see if it would run. But it probably needs a bit more finesse than that. Ha. The fastq.qz file is in the folder /work/phylo/NGS/ConsGenome/DKE121_attempt/. Basically, all I need is a consensus sequence generated from their raw reads. Then I can replicate everything else that I have already done. The project itself is at https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR14530608  if you need to look at it for any reason.

Public data

Details of project can be found at https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR14530608

Run

Spots

Bases

Size

GC content

Published

Access Type

Run

Spots

Bases

Size

GC content

Published

Access Type

SRR14530608

1.0M

512.6Mbp

298.3M

47.6%

2021-05-15

public

Fetch public data using sra toolkit

prefetch SRR14530608

Split downloaded sra file into FASTQ

fastq-dump --split-files SRR14530608

remove adaptor sequences using eresearch nextflow/trimgalore

#!/bin/bash -l #PBS -N nftrimgalore #PBS -l select=1:ncpus=2:mem=4gb #PBS -l walltime=48:00:00 #Use the current directory to run the workflow cd $PBS_O_WORKDIR module load java NXF_OPTS='-Xms1g -Xmx4g' #run the nextflow trimgalore tool to remove adaptor sequences nextflow run /home/barrero/nextflow/mt18005/trimgalore/main.nf --indexfile index.csv

Initial QC results

  • Overall there is a substantial 5’end bias of nucleotide composition owing to the amplicon approach used to generate the sample sequences.

  • There is a number of sequences with quality scores below minimal recommended Q20 (99.0% accuracy)

  • Over-represented sequences show abundant poly Guanosine sequences

 

READ1

READ2

MultiQC

ConsGenome pipeline

Upon initial QC of FASTQ files, these where subjected to a modified version of the ConsGenome pipeliene to assembly a consensus sequence and to build a reference guided consensus genome:

#!/bin/bash -l #PBS -N ConsGenome #PBS -l walltime=24:00:00 #PBS -l mem=20gb #PBS -l ncpus=8 #PBS -m bae #PBS -M email@host #PBS -j oe #eResearch Office, Queensland University of Technology (QUT) #Script : ConsGenome: A Virus Genome Assembly, Variant Calling and building a Consensus Genome workflow #Created: 25/05/2021 #Last modified: 25/05/2021 ## Usage: qsub launch_QC_Assembly_Mapping.pbs cd $PBS_O_WORKDIR conda activate ConsGenome ## USER DEFINED INFORMATION ## Provide a sample name and name of raw data files prior running the workflow config=config.txt . $config printf 'SAMPLEID is %s\n' "$SAMPLEID" printf 'READ1 is %s\n' "$READ1" printf 'READ2 is %s\n' "$READ2" printf 'REF is %s\n' "$REF" printf 'REFNAME is %s\n' "$REFNAME" ######################################## ## Automated assignment of QCed reads ## ######################################## R1=$(echo "$READ1" | cut -f 1 -d '.') R2=$(echo "$READ2" | cut -f 1 -d '.') QCREAD1=$(echo "${R1}_val_1.fq") QCREAD2=$(echo "${R2}_val_2.fq") ######################################## ### WORKFLOW KMER=31_43_57 ################################### ### PROCESS 1: QUALITY CONTROL ### ################################### #make output folder mkdir -p work mkdir -p work/trimgalore mkdir -p results # Remove adaptors and poor quality bases/reads using trimgalore #trim_galore --length 120 --paired -q 30 -a CTGTCTCTTATACACATCT --fastqc --clip_R1 10 --clip_R2 10 --three_prime_clip_R1 2 --three_prime_clip_R2 2 -o ./work/trimgalore ${READ1} ${READ2} trim_galore --length 120 --paired -q 30 -a CTGTCTCTTATACACATCT --fastqc -o ./work/trimgalore ${READ1} ${READ2} #uncompress file of QCed reads gzip -d ./work/trimgalore/*.gz ########################################## ### PROCESS 2: DE NOVO GENOME ASSEMBLY ### ########################################## #metaspades.py -o metaspades_${SAMPLEID}_${KMER} -k 31,43,57 --only-assembler -t 8 -1 ./trimgalore/${QCREAD1} -2 ./trimgalore/${QCREAD2} mkdir -p work/trimgalore/100K #fetch subset of 100K sequences head -400000 ./work/trimgalore/${QCREAD1} > ./work/trimgalore/100K/${QCREAD1} head -400000 ./work/trimgalore/${QCREAD2} > ./work/trimgalore/100K/${QCREAD2} #mkdir -p work/trimgalore/2M #fetch subset of 2M sequences #head -8000000 ./work/trimgalore/${QCREAD1} > ./work/trimgalore/2M/${QCREAD1} #head -8000000 ./work/trimgalore/${QCREAD2} > ./work/trimgalore/2M/${QCREAD2} mkdir -p work/trimgalore/5M #fetch subset of 5M sequences head -20000000 ./work/trimgalore/${QCREAD1} > ./work/trimgalore/5M/${QCREAD1} head -20000000 ./work/trimgalore/${QCREAD2} > ./work/trimgalore/5M/${QCREAD2} #assembly metaspades.py -o ./work/metaspades_${SAMPLEID}_${KMER}_100K -k 31,43,57 -t 8 -1 ./work/trimgalore/100K/${QCREAD1} -2 ./work/trimgalore/100K/${QCREAD2} metaspades.py -o ./work/metaspades_${SAMPLEID}_${KMER}_ALL -k 31,43,57 -t 8 -1 ./work/trimgalore/${QCREAD1} -2 ./work/trimgalore/${QCREAD2} ########################################### ### PROCESS 3: COMPARISON TO REF GENOME ### ########################################### #make new working directory mkdir -p work/blastn_${SAMPLEID}_${KMER} #copy final scaffolds to new file renamed with the sample ID cp ./work/metaspades_${SAMPLEID}_${KMER}_100K/scaffolds.fasta ./work/blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_scaffolds.fasta cp ./work/metaspades_${SAMPLEID}_${KMER}_ALL/scaffolds.fasta ./work/blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_scaffolds_ALL.fasta #copy to 'results' (de novo assembled contigs) cp ./work/blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_scaffolds.fasta ./results cp ./work/blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_scaffolds_ALL.fasta ./results #BLASTN #blastn -query blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_scaffolds.fasta -subject /work/phylo/NGS/nextflow/genome/bowtie-index/NC_001475.2_DENV3ReferenceSequence.fasta -evalue 1e-20 -outfmt 6 -out blastn_${SAMPLEID}_vs_DENV3.txt blastn -query ./work/blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_scaffolds.fasta -subject ${REF} -evalue 1e-20 -outfmt 6 -out ./work/blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_blastn_${REFNAME}_vs_scaffolds.txt blastn -query ./work/blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_scaffolds_ALL.fasta -subject ${REF} -evalue 1e-20 -outfmt 6 -out ./work/blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_blastn_${REFNAME}_vs_scaffolds_ALL.txt #copy to 'results' (blastN results of comparing Ref Genome against assembled scaffolds) cp ./work/blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_blastn_${REFNAME}_vs_scaffolds.txt ./results cp ./work/blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_blastn_${REFNAME}_vs_scaffolds_ALL.txt ./results ################################################ ### PROCESS 4: Mapping onto reference genome ### ################################################ mkdir -p work/mapping #align using BWA MEM, use 8 threads, allow soft clipping to increase mapping, outoput sam aln file bwa mem -t 8 -Y ${REF} ./work/trimgalore/${QCREAD1} ./work/trimgalore/${QCREAD2} | awk '$3!="*"' > ./work/mapping/${SAMPLEID}.sam # samtools: sort .sam file and convert to .bam file samtools view -bS ./work/mapping/${SAMPLEID}.sam | samtools sort - -o ./work/mapping/${SAMPLEID}.bam # Get consensus fastq file #samtools mpileup -uf /work/phylo/NGS/nextflow/genome/bowtie-index/NC_001475.2_DENV3ReferenceSequence.fasta ${SAMPLEID}.bam | bcftools call -c | vcfutils.pl vcf2fq > ${SAMPLEID}_cns.fastq samtools mpileup -uf ${REF} ./work/mapping/${SAMPLEID}.bam | bcftools call -c | vcfutils.pl vcf2fq > ./work/mapping/${SAMPLEID}_cns.fastq # Convert .fastq to .fasta and set bases of quality lower than 20 to N seqtk seq -aQ64 -q10 -n N ./work/mapping/${SAMPLEID}_cns.fastq > ./work/mapping/${SAMPLEID}_bwa_mapped_consensus.fasta #copy to 'results' (consensus sequence based on bwa mapping, variant calling and filtering) cp ./work/mapping/${SAMPLEID}_bwa_mapped_consensus.fasta ./results #blastN new consensus against assembled scaffolds blastn -query ./work/blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_scaffolds.fasta -subject ./work/mapping/${SAMPLEID}_bwa_mapped_consensus.fasta -evalue 1e-20 -outfmt 6 -out ./work/mapping/${SAMPLEID}_blastn_consensus_vs_scaffolds.txt #copy to 'results' (blastN of consensus vs assembled sacffolds) cp ./work/mapping/${SAMPLEID}_blastn_consensus_vs_scaffolds.txt ./results ################################## ### PROCESS 5: Variant calling ### ################################## mkdir -p work/variantCalling #variant calling bcftools mpileup -Ou -f ${REF} ./work/mapping/${SAMPLEID}.bam | bcftools call -mv -Oz -o ./work/variantCalling/${SAMPLEID}_calls.vcf.gz bcftools index ./work/variantCalling/${SAMPLEID}_calls.vcf.gz -o ./work/variantCalling/${SAMPLEID}_calls.vcf.gz.csi # normalize indels bcftools norm -f ${REF} ./work/variantCalling/${SAMPLEID}_calls.vcf.gz -Ob -o ./work/variantCalling/${SAMPLEID}_calls.norm.bcf # filter adjacent indels within 5bp bcftools filter --IndelGap 5 ./work/variantCalling/${SAMPLEID}_calls.norm.bcf -Ob -o ./work/variantCalling/${SAMPLEID}_calls.norm.flt-indels.bcf #convert bcf to vcf bcftools view -Oz -o ./work/variantCalling/${SAMPLEID}_calls.norm.flt-indels.vcf ./work/variantCalling/${SAMPLEID}_calls.norm.flt-indels.bcf #bcftools view -Oz -o ${SAMPLEID}_calls.norm.flt-LowQual.vcf ${SAMPLEID}_calls.norm.flt-LowQual.bcf #convert bcf to vcf bcftools view -Oz -o ./work/variantCalling/${SAMPLEID}_sequence_variants.vcf ./work/variantCalling/${SAMPLEID}_calls.norm.flt-indels.bcf #copy to 'results' (sequence variants) cp ./work/variantCalling/${SAMPLEID}_sequence_variants.vcf ./results #record tools used conda env export --from-history > ./results/environment.yml echo "completed"

Associated config file:

## USER DEFINED INFORMATION #sample ID or sample name SAMPLEID=DKE_121 #Raw fastq files (can be compresed: *.fastq.gz or not: *.fq or *fastq) READ1=SRR14530608_1.fastq READ2=SRR14530608_2.fastq #Ref Genome #provide full path to fasta file. Example: REF=/work/phylo/NGS/ConsGenome/genome/NC_002640.1_DENV4ReferenceSequence.fasta #Specify Ref Genome name REFNAME=DENV4ReferenceSequence

 

Related content