DKE 121 genome assembly
Aim
Assemble the genome for the Dengue DKE121 strain using public amplicon seq data.
Background
A new highly divergent strain of dengue serotype 4 that was isolated in 2007 finally got published, so we have decided to include it in the analysis as well. Should be relatively easy to accomplish. However, they uploaded the raw read data to GenBank instead of a consensus sequence. It is all in one file of paired Illumina reads. I have attempted to modify the ConsGenome script, but only really eliminated anything that said READ2 and was hoping to see if it would run. But it probably needs a bit more finesse than that. Ha. The fastq.qz file is in the folder /work/phylo/NGS/ConsGenome/DKE121_attempt/. Basically, all I need is a consensus sequence generated from their raw reads. Then I can replicate everything else that I have already done. The project itself is at https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR14530608 if you need to look at it for any reason.
Public data
Details of project can be found at https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR14530608
Run | Spots | Bases | Size | GC content | Published | Access Type |
---|---|---|---|---|---|---|
SRR14530608 | 1.0M | 512.6Mbp | 298.3M | 47.6% | 2021-05-15 | public |
Fetch public data using sra toolkit
prefetch SRR14530608
Split downloaded sra file into FASTQ
fastq-dump --split-files SRR14530608
remove adaptor sequences using eresearch nextflow/trimgalore
#!/bin/bash -l
#PBS -N nftrimgalore
#PBS -l select=1:ncpus=2:mem=4gb
#PBS -l walltime=48:00:00
#Use the current directory to run the workflow
cd $PBS_O_WORKDIR
module load java
NXF_OPTS='-Xms1g -Xmx4g'
#run the nextflow trimgalore tool to remove adaptor sequences
nextflow run /home/barrero/nextflow/mt18005/trimgalore/main.nf --indexfile index.csv
Initial QC results
Overall there is a substantial 5’end bias of nucleotide composition owing to the amplicon approach used to generate the sample sequences.
There is a number of sequences with quality scores below minimal recommended Q20 (99.0% accuracy)
Over-represented sequences show abundant poly Guanosine sequences
READ1
READ2
MultiQC
ConsGenome pipeline
Upon initial QC of FASTQ files, these where subjected to a modified version of the ConsGenome pipeliene to assembly a consensus sequence and to build a reference guided consensus genome:
#!/bin/bash -l
#PBS -N ConsGenome
#PBS -l walltime=24:00:00
#PBS -l mem=20gb
#PBS -l ncpus=8
#PBS -m bae
#PBS -M email@host
#PBS -j oe
#eResearch Office, Queensland University of Technology (QUT)
#Script : ConsGenome: A Virus Genome Assembly, Variant Calling and building a Consensus Genome workflow
#Created: 25/05/2021
#Last modified: 25/05/2021
## Usage: qsub launch_QC_Assembly_Mapping.pbs
cd $PBS_O_WORKDIR
conda activate ConsGenome
## USER DEFINED INFORMATION
## Provide a sample name and name of raw data files prior running the workflow
config=config.txt
. $config
printf 'SAMPLEID is %s\n' "$SAMPLEID"
printf 'READ1 is %s\n' "$READ1"
printf 'READ2 is %s\n' "$READ2"
printf 'REF is %s\n' "$REF"
printf 'REFNAME is %s\n' "$REFNAME"
########################################
## Automated assignment of QCed reads ##
########################################
R1=$(echo "$READ1" | cut -f 1 -d '.')
R2=$(echo "$READ2" | cut -f 1 -d '.')
QCREAD1=$(echo "${R1}_val_1.fq")
QCREAD2=$(echo "${R2}_val_2.fq")
########################################
### WORKFLOW
KMER=31_43_57
###################################
### PROCESS 1: QUALITY CONTROL ###
###################################
#make output folder
mkdir -p work
mkdir -p work/trimgalore
mkdir -p results
# Remove adaptors and poor quality bases/reads using trimgalore
#trim_galore --length 120 --paired -q 30 -a CTGTCTCTTATACACATCT --fastqc --clip_R1 10 --clip_R2 10 --three_prime_clip_R1 2 --three_prime_clip_R2 2 -o ./work/trimgalore ${READ1} ${READ2}
trim_galore --length 120 --paired -q 30 -a CTGTCTCTTATACACATCT --fastqc -o ./work/trimgalore ${READ1} ${READ2}
#uncompress file of QCed reads
gzip -d ./work/trimgalore/*.gz
##########################################
### PROCESS 2: DE NOVO GENOME ASSEMBLY ###
##########################################
#metaspades.py -o metaspades_${SAMPLEID}_${KMER} -k 31,43,57 --only-assembler -t 8 -1 ./trimgalore/${QCREAD1} -2 ./trimgalore/${QCREAD2}
mkdir -p work/trimgalore/100K
#fetch subset of 100K sequences
head -400000 ./work/trimgalore/${QCREAD1} > ./work/trimgalore/100K/${QCREAD1}
head -400000 ./work/trimgalore/${QCREAD2} > ./work/trimgalore/100K/${QCREAD2}
#mkdir -p work/trimgalore/2M
#fetch subset of 2M sequences
#head -8000000 ./work/trimgalore/${QCREAD1} > ./work/trimgalore/2M/${QCREAD1}
#head -8000000 ./work/trimgalore/${QCREAD2} > ./work/trimgalore/2M/${QCREAD2}
mkdir -p work/trimgalore/5M
#fetch subset of 5M sequences
head -20000000 ./work/trimgalore/${QCREAD1} > ./work/trimgalore/5M/${QCREAD1}
head -20000000 ./work/trimgalore/${QCREAD2} > ./work/trimgalore/5M/${QCREAD2}
#assembly
metaspades.py -o ./work/metaspades_${SAMPLEID}_${KMER}_100K -k 31,43,57 -t 8 -1 ./work/trimgalore/100K/${QCREAD1} -2 ./work/trimgalore/100K/${QCREAD2}
metaspades.py -o ./work/metaspades_${SAMPLEID}_${KMER}_ALL -k 31,43,57 -t 8 -1 ./work/trimgalore/${QCREAD1} -2 ./work/trimgalore/${QCREAD2}
###########################################
### PROCESS 3: COMPARISON TO REF GENOME ###
###########################################
#make new working directory
mkdir -p work/blastn_${SAMPLEID}_${KMER}
#copy final scaffolds to new file renamed with the sample ID
cp ./work/metaspades_${SAMPLEID}_${KMER}_100K/scaffolds.fasta ./work/blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_scaffolds.fasta
cp ./work/metaspades_${SAMPLEID}_${KMER}_ALL/scaffolds.fasta ./work/blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_scaffolds_ALL.fasta
#copy to 'results' (de novo assembled contigs)
cp ./work/blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_scaffolds.fasta ./results
cp ./work/blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_scaffolds_ALL.fasta ./results
#BLASTN
#blastn -query blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_scaffolds.fasta -subject /work/phylo/NGS/nextflow/genome/bowtie-index/NC_001475.2_DENV3ReferenceSequence.fasta -evalue 1e-20 -outfmt 6 -out blastn_${SAMPLEID}_vs_DENV3.txt
blastn -query ./work/blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_scaffolds.fasta -subject ${REF} -evalue 1e-20 -outfmt 6 -out ./work/blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_blastn_${REFNAME}_vs_scaffolds.txt
blastn -query ./work/blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_scaffolds_ALL.fasta -subject ${REF} -evalue 1e-20 -outfmt 6 -out ./work/blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_blastn_${REFNAME}_vs_scaffolds_ALL.txt
#copy to 'results' (blastN results of comparing Ref Genome against assembled scaffolds)
cp ./work/blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_blastn_${REFNAME}_vs_scaffolds.txt ./results
cp ./work/blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_blastn_${REFNAME}_vs_scaffolds_ALL.txt ./results
################################################
### PROCESS 4: Mapping onto reference genome ###
################################################
mkdir -p work/mapping
#align using BWA MEM, use 8 threads, allow soft clipping to increase mapping, outoput sam aln file
bwa mem -t 8 -Y ${REF} ./work/trimgalore/${QCREAD1} ./work/trimgalore/${QCREAD2} | awk '$3!="*"' > ./work/mapping/${SAMPLEID}.sam
# samtools: sort .sam file and convert to .bam file
samtools view -bS ./work/mapping/${SAMPLEID}.sam | samtools sort - -o ./work/mapping/${SAMPLEID}.bam
# Get consensus fastq file
#samtools mpileup -uf /work/phylo/NGS/nextflow/genome/bowtie-index/NC_001475.2_DENV3ReferenceSequence.fasta ${SAMPLEID}.bam | bcftools call -c | vcfutils.pl vcf2fq > ${SAMPLEID}_cns.fastq
samtools mpileup -uf ${REF} ./work/mapping/${SAMPLEID}.bam | bcftools call -c | vcfutils.pl vcf2fq > ./work/mapping/${SAMPLEID}_cns.fastq
# Convert .fastq to .fasta and set bases of quality lower than 20 to N
seqtk seq -aQ64 -q10 -n N ./work/mapping/${SAMPLEID}_cns.fastq > ./work/mapping/${SAMPLEID}_bwa_mapped_consensus.fasta
#copy to 'results' (consensus sequence based on bwa mapping, variant calling and filtering)
cp ./work/mapping/${SAMPLEID}_bwa_mapped_consensus.fasta ./results
#blastN new consensus against assembled scaffolds
blastn -query ./work/blastn_${SAMPLEID}_${KMER}/${SAMPLEID}_scaffolds.fasta -subject ./work/mapping/${SAMPLEID}_bwa_mapped_consensus.fasta -evalue 1e-20 -outfmt 6 -out ./work/mapping/${SAMPLEID}_blastn_consensus_vs_scaffolds.txt
#copy to 'results' (blastN of consensus vs assembled sacffolds)
cp ./work/mapping/${SAMPLEID}_blastn_consensus_vs_scaffolds.txt ./results
##################################
### PROCESS 5: Variant calling ###
##################################
mkdir -p work/variantCalling
#variant calling
bcftools mpileup -Ou -f ${REF} ./work/mapping/${SAMPLEID}.bam | bcftools call -mv -Oz -o ./work/variantCalling/${SAMPLEID}_calls.vcf.gz
bcftools index ./work/variantCalling/${SAMPLEID}_calls.vcf.gz -o ./work/variantCalling/${SAMPLEID}_calls.vcf.gz.csi
# normalize indels
bcftools norm -f ${REF} ./work/variantCalling/${SAMPLEID}_calls.vcf.gz -Ob -o ./work/variantCalling/${SAMPLEID}_calls.norm.bcf
# filter adjacent indels within 5bp
bcftools filter --IndelGap 5 ./work/variantCalling/${SAMPLEID}_calls.norm.bcf -Ob -o ./work/variantCalling/${SAMPLEID}_calls.norm.flt-indels.bcf
#convert bcf to vcf
bcftools view -Oz -o ./work/variantCalling/${SAMPLEID}_calls.norm.flt-indels.vcf ./work/variantCalling/${SAMPLEID}_calls.norm.flt-indels.bcf
#bcftools view -Oz -o ${SAMPLEID}_calls.norm.flt-LowQual.vcf ${SAMPLEID}_calls.norm.flt-LowQual.bcf
#convert bcf to vcf
bcftools view -Oz -o ./work/variantCalling/${SAMPLEID}_sequence_variants.vcf ./work/variantCalling/${SAMPLEID}_calls.norm.flt-indels.bcf
#copy to 'results' (sequence variants)
cp ./work/variantCalling/${SAMPLEID}_sequence_variants.vcf ./results
#record tools used
conda env export --from-history > ./results/environment.yml
echo "completed"
Associated config file:
## USER DEFINED INFORMATION
#sample ID or sample name
SAMPLEID=DKE_121
#Raw fastq files (can be compresed: *.fastq.gz or not: *.fq or *fastq)
READ1=SRR14530608_1.fastq
READ2=SRR14530608_2.fastq
#Ref Genome
#provide full path to fasta file. Example:
REF=/work/phylo/NGS/ConsGenome/genome/NC_002640.1_DENV4ReferenceSequence.fasta
#Specify Ref Genome name
REFNAME=DENV4ReferenceSequence