Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

General information

The viral surveillance and diagnosis (VSD) bioinformatics toolkit is a pipeline based on the scientific workflow manager Nextflow.

It is designed to help phytosanitary diagnostics of viruses and viroid pathogens in quarantine facilities. It takes small RNA-Seq samples as input.

Installation

Install Nextflow

The VSD workflow requires Nextflow to be installed in your account on the HPC. Find details on how to install and test Nextflow here prepare a nextflow.config file and run a PBS pro submission script for Nextflow pipelines.

Pull the git repo using:

Code Block
nextflowgit pull command then add -rclone file:///work/pipelines/eresearch/vsd -b vsd-1.0

Install conda3 or miniconda3

https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html

...

Code Block
name: vsd-1.0
channels:
  - bioconda
  - conda-forge
  - defaults
  - r
  - anaconda
dependencies:
  - blast=2.11.0
  - cap3=10.2011
  - spades=3.14.0
  - emboss=6.6.0
  - openjdk=8.0.152
  - fastp=0.20.1
  - biopython=1.76
  - numpy=1.16.5
  - matplotlib=2.2.3
  - velvet=1.2.10
  - bowtie=1.3.0
  - samtools=1.12
  - picard==2.25.6
  - bedtools
  - bcftools
  - pandas
  - fastqc=0.11.9
  - fastp=0.20.1
  - cutadapt=3.5
  - umi_tools=1.1.2

Install a local NCBI blast directory (NT and NR)

Find detailed infor on how to download these databases at https://www.ncbi.nlm.nih.gov/books/NBK569850/

Make sure the taxdb.btd and the taxdb.bti files are also present in the directory.

Create a folder where you will store your NCBI database including the date of download. For instance:

Code Block
mkdir blastDB/30112021

Run the following PBS script in the newly created folder. Use the update_blastdb.pl script from the blast+ version you will use with your pipeline.

Code Block
#!/bin/bash -l
#PBS -N blastdb_download
#PBS -l walltime=24:00:00
#PBS -l mem=60gb
#PBS -l ncpus=2

cd $PBS_O_WORKDIR
perl update_blastdb.pl --decompress nt [*]
perl update_blastdb.pl --decompress nr [*]
perl update_blastdb.pl taxdb
tar -xzf taxdb.tar.gz

The VSD workflow

The VSD workflow will perform the following steps by default:

  • Retain reads of a given length (e.g. 21-22 or 24 nt long) from fastq file(s) provided in index.csv file (readprocessing)

  • De novo assembly using kmer 15 and coverage 3 (velvet) -

  • Collapse contigs into scaffolds (min length 20) (cap3)

  • Run megablast homology search against NCBI NT database (megablast_nt_velvet)

  • Summarise megablast results and restrict to virus and viroid matches (BlastTools_megablast_velvet)

  • Derive coverage statistics, consensus sequence and VCF matching to top blast hits (filter_n_cov)

...

Code Block
params {
blastlocaldb = true
spades = true
contamination_detection = true
}

Preparing the data

Preparing

...

an index.csv file

You need to create a TAB delimited text file that will be the input for the workflow. By default the pipeline will look for a file called “index.csv” in the base directory but you can specify any file name using the --indexfile [filename] in the nextflow run command. This text file requires the following columns (which needs to be included as a header): sampleid,samplepath,minlen,maxlen:

...

Code Block
blastn_local_db = '/work/hia_mt18005/databases/PVirDB/PVriDB_ver2021_11_09/PVirDB_ver20211109.fasta'

Running the pipeline

Finally you need to create a PBS script which includes your nextflow run command. An example of PBS script is included in the base directory and will run the pipeline with default steps:

...

Code Block
qsub nextflow_example.pbs

Monitoring the run

You can use the command

Code Block
qstat -u $USER

...

Finally, if you have configured the connection to the NFTower you can logon and check your run.

Outputs

Under the results folder, the pipeline will populate outputs under separate folders for each step. These will be stored in subfolders for each sample:

...

→ sample 3

etc…

The folders are structures as follows (examples of outputs are provided in italics):

  • 01_read_size_selection (cutadapt log file and fastq file including reads only matching the size specified in the index.csv file) MT020_21-22nt_cutadapt.log & MT020_21-22nt.fastq

  • 02_velvet (velvet results and the fasta file which includes the velvet assembled contigs MT020_velvet_assembly_21-22nt.fasta

  • 02a_spades (if spades is additionally run)

  • 03_cap3 (fasta file of the scaffolds produced by CAP3 as well as the singletons) MT020_velvet_cap3_21-22nt_rename.fasta

  • 04_blastn (all blastn results, filtered results limited to only viruses and viroid top 5 hit matches and their taxonomy) MT020_velvet_21-22nt_megablast_vs_NT.bls, MT020_velvet_21-22nt_megablast_vs_NT_top5Hits.txt, MT020_velvet_21-22nt_megablast_vs_NT_top5Hits_virus_viroids_final.txt MT020_velvet_21-22nt_megablast_vs_NT_top5Hits_virus_viroids_seq_ids_taxonomy.txt

  • 05_blastoutputs (BlastTools.jar summary output which clusters all the contigs matching to a specific hit. summary_MT029_velvet_21-22nt_megablast_vs_NT_top5Hits_virus_viroids_final.txt

  • 06_blastp (blastp outputs) MT020_velvet_21-22nt_getorf.min50aa.fasta, MT020_velvet_21-22nt_getorf.min50aa_blastp_vs_NR_out_virus_viroid.txt

  • 07_filternstats (filtered blast summary with various coverage statistics for each virus and viroid hit, and associated consensus fasta file and vcf file) MT020_21-22nt_top_scoring_targets_with_cov_stats.txt, MT020_21-22nt_MK929590_Peach_latent_mosaic_viroid.consensus.fasta, MT020_21-22nt_MK929590_Peach_latent_mosaic_viroid_sequence_variants.vcf.gz

  • 08_report summary (summary of results for all samples included in the index.csv file. This includes a cross-contamination prediction) run_top_scoring_targets_with_cov_stats_with_cont_flag_21-22nt_0.01.txt.

Future potential additional features:

  • Include a deduplication step for fastq files that have UMIs incorporated

  • Incorporate the fastq file initial filtering steps from sRNAqc as option

  • Work on final summary report

  • Add coverage statistics and cross contamination flag logic to local db blast results

  • Incorporate VirusDetect in the pipeline and derive a summary of results from both pipelines

  • Perform automatically 21-22nt and 24nt analyses by default