Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

Introduction

eresearchqut/VirReport is a bioinformatics pipeline based upon the scientific workflow manager Nextflow. It was designed to help phytosanitary diagnostics of viruses and viroid pathogens in quarantine facilities. It takes small RNA-Seq fastq files as input. These can either be in raw format (currently only samples specifically prepared with the QIAGEN QIAseq miRNA library kit can be processed this way) or quality-filtered.

...

https://carpentries-incubator.github.io/workflows-nextflow/

Pipeline summary

...

The VirReport workflow will perform the following steps by default:

...

  • Searches against local NCBI NT and NR databases:

    • Retain top 5 megablast hits and restrict results to virus and viroid matches. Summarise results by grouping all the de novo contigs matching to the same viral hit and deriving the cumulative blast coverage and percent ID for each viral hit (BLASTN_NT_CAP3)

    • Align reads to top hit, derive coverage statistics, consensus sequence and VCF matching to top blast hits (COVSTATS_NT)

    • Run blastx homolgy search on contigs >= 90 bp long for which no match was obtained in the megablast search. Summarise the blastx results and restrict to virus and viroid matches (BLASTX)

  • A quality filtering step on raw fastq files (currently the workflow only processes samples prepared using QIAGEN QIAseq miRNA library kit). After performing quality filtering (FASTQC_RAW, ADAPTER_AND_QUAL_TRIMMING, QC_POST_QUAL_TRIMMING, DERIVE_USABLE_READS). the pipeline will also derive a qc report (QCREPORT). An RNA souce profile can be included as part of this step (RNA_SOURCE_PROFILE, RNA_SOURCE_PROFILE_REPORT)

  • VirusDetect version 1.8 can also be run in parallel. A summary of the top virus/viroid blastn hits will be separately output (VIRUS_DETECT, VIRUS_IDENTIFY, VIRUS_DETECT_BLASTN_SUMMARY, VIRUS_DETECT_BLASTN_SUMMARY_FILTERED)

Pipeline prerequisites

Installing Java and Nextflow

You can follow the steps outlined in the Nexflow documentation to install Java and Nextflow on your local machine or server:https://www.nextflow.io/docs/latest/getstarted.html

This link specifically describes the steps to take to load Java and install Nextflow on our local HPC at QUT (Lyra): Nextflow

Installing a suitable environment management system

To run the VirReport pipeline, you will need to install a suitable environment management system such as Docker, Singularity or Conda to suit your environment.

We use conda or miniconda on our HPC.

Installing VirReport

The open-source VirReport code is available at https://github.com/eresearchqut/VirReport

...

Code Block
git clone https://github.com/eresearchqut/VirReport.git

Running the pipeline

You can either invoke the pipeline by pointing to the location of main.nf in the version of VirReport you cloned, for example:

Code Block
nextflow run ~/code/github/clean/VirReportpath_to_local_VirReport_copy/main.nf

or run directly eresearchqut/VirReport

...

Code Block
nextflow run eresearchqut/VirReport -profile {docker, singularity or conda}

On our HPC you can either , specify singularity or conda as the profile.

Depending on the one selected, cached Cached environment will be built in your home directory under either the cached singularity or conda directory. This step will take some time the first time you run the pipeline.

Testing the pipeline on minimal test dataset:

Running these test datasets requires 2 cpus and 8 Gb mem and should take less than 5 mins to complete.

...

This first command will test your installation using a single quality filtered fastq file (called test.fastq.gz) derived from a sample infected with citrus exocortis viroid and will run VirReport using a mock ncbi database (we recommend to use singularity on our local Lyra server at QUT):

Code Block
nextflow -c conf/test.config run eresearchqut/VirReport -profile test,{docker, singularity or conda}

...

 -latest -r main

...

This second command will test your installation using a pair of raw fastq files (called test_pair_1.fastq.gz and test_pair_2.fastq.gz) derived from a sample infected with citrus tristeza virus and will run VirReport using a mock viral database (we recommend to use singularity on our local Lyra server at QUT):

Code Block
nextflow -c conf/test2.config run eresearchqut/VirReport -profile test2,{docker, singularity or conda} -latest -r main

...

If both of these tests finish successfully, this means that the pipeline was set up properly.

...

You are now all set to analyse your own samples.

Running the pipeline with your own data

Provide an index.csv file

Create a TAB delimited text file that will be the input for the workflow to run. By default the pipeline will look for a file called “index.csv” in the base directory but you can specify any file name using the --indexfile [filename] in the nextflow run command. This text file requires the following columns (which needs to be included as a header): sampleid,samplepath

...

Code Block
params {
  contamination_detection = true
}

Provide a database
  • By default, the pipeline is set to run homology blast searches against a local plant virus/viroid database (this is set in the nextflow.config file with parameter --virreport_viral_db = true. You will need to provide this database to run the pipeline. You can either provide your own or use a curated database provided at https://github.com/maelyg/PVirDB.git . Ensure you use NCBI BLAST+ makeblastdb to create the database. For instance, to set up this database, you would take the following steps:

    Code Block
    git clone https://github.com/maelyg/PVirDB.git
    cd PVirDB
    gunzip PVirDB_v1.fasta.gz
    makeblastdb -in PVirDB_v1.fasta -parse_seqids -dbtype nucl

    Then specify the full path to the database files including the prefix in the nextflow.config file. For example:

    Code Block
    params {
      blast_local_db_path = '/path_to_viral_DB/viral_DB_name'
    }

...

  • If you also want to run homology searches against public NCBI databases, you need to set the parameter virreport_ncbi in the nextflow.config file to true:

    Code Block
    params {
      virreport_ncbi = true
    }

    or add it in your nextflow command:

    Code Block
    nextflow run eresearchqut/VirReport -profile {docker, singularity or conda} --virreport_ncbi

    Download these locally, following the detailed steps available at https://www.ncbi.nlm.nih.gov/books/NBK569850/ . Create a folder where you will store your NCBI databases. It is good practice to include the date of download. For instance:

    Code Block
    mkdir blastDB/30112021

    You will need to use the update_blastdb.pl script from the blast+ version used with the pipeline.
    For example:

    Code Block
    perl update_blastdb.pl --decompress nt [*]
    perl update_blastdb.pl --decompress nr [*]
    perl update_blastdb.pl taxdb
    tar -xzf taxdb.tar.gz

    Make sure the taxdb.btd and the taxdb.bti files are present in the same directory as your blast databases.
    Specify the path of your local NCBI blast nt and nr directories in the nextflow.config file.
    For instance:

    Code Block
    params {
      blast_db_dir = '/work/hia_mt18005_db/blastDB/20220408'
    }
Fastq files

You can either provide raw or pre-filtered fastq files to the pipeline.

...

Code Block
sampleid,samplepath
CT103,/path_to_fastq_files_directory/CT_103_S10_L001_R1_001.fastq.gz
CT103,/path_to_fastq_files_directory/CT_103_S10_L002_R1_001.fastq.gz
Deduplicate reads using unique molecular identifiers and mapping coordinates

If you used the QIAGEN QIAseq miRNA library kit for cDNA library preparation, your reads will include unique molecular identifiers (UMIs) at their 5' end. These are automatically extracted from reads and incorporated at the end of the read name when the pipeline is run using the --qualityfilter parameter.

If you want to deduplicate reads from the BAM files that are derived at the alignment step, you will need to specify the --dedup parameter. UMI extraction and deduplication are performed with the tool umitools.

Summarising detections and flagging potential contamination events

...

Running homology searches against viral database PVirDB
Code Block
nextflow run eresearchqut/VirReport -profile singularity -resume --indexfile index.csv \
                                    --merge_lane --qualityfilter --rna_source_profile \
                                    --bowtie_db_dir /path_to_bowtie_indices \
                                    --virreport_viral_db --blast_viral_db_path /path_to_local_viral_database
Running homology searches against NCBI NT/NR database
Code Block
nextflow run eresearchqut/VirReport -profile singularity -resume --indexfile index.csv \
                                    --merge_lane --qualityfilter --rna_source_profile \
                                    --bowtie_db_dir /path_to_bowtie_indices \
                                    --virreport_ncbi --blast_viral_db_path /path_to_ncbi_databases
Deduplicate reads using unique molecular identifiers and mapping coordinates

If you used the QIAGEN QIAseq miRNA library kit for cDNA library preparation, your reads will include unique molecular identifiers (UMIs) at their 5' end. These are automatically extracted from reads and incorporated at the end of the read name when the pipeline is run using the --qualityfilter parameter.

If you want to deduplicate reads from the BAM files that are derived at the alignment step, you will need to specify the --dedup parameter. UMI extraction and deduplication are performed with the tool umitools.

Summarising detections and flagging potential contamination events

If you want to derive a summary of detections for all the samples included in the index file, specify the --contamination_detection_viral_db or the --contamination_detection_ncbi option. This will create a summary text file under the Summary tab with a column called contamination_flag

Running the Velvet pipeline

VirusDetect version 1.8 can also be run in parallel.

See http://virusdetect.feilab.net/cgi-bin/virusdetect/index.cgi for details about this separate pipeline.

Example of PBS script to run on an HPC with torque batch system

Make sure to either specify the full path to your index.csv file in the PBS script or place a copy of the index.csv file in the folder you will run the PBS script in.

The PBS script example below (VirReport_nextflow.sh) will run on raw fastq files that will need to be merged and then quality filtered.

We are asking to derive an RNA source profile for the samples during the quality filtering step.

Homology searches will be run against NCBI and the PVirDB. The pipeline will also run VirusDetect in parallel.

Finally we will want the reads to be de-duplicated after mapping.

Code Block
#!/bin/bash -l
#PBS -N VirReport
#PBS -l select=1:ncpus=2:mem=8gb
#PBS -l walltime=05:00:00


cd $PBS_O_WORKDIR
module load java
NXF_OPTS='-Xms1g -Xmx4g'

nextflow run eresearchqut/VirReport -profile singularity -resume --indexfile index.csv \
                                    --merge_lane --qualityfilter --rna_source_profile \
                                    --bowtie_db_dir /path_to_bowtie_indices \
                                    --dedup \
                                    --virreport_ncbi --blast_viral_db_path /path_to_ncbi_databases --contamination_detection \
                                    --virreport_viral_db --blast_viral_db_path /path_to_local_viral_database --contamination_detection_viral_db \
                                    --virusdetect --virusdetect_db_path /path_to_virusdetect_database

Submit your job using the qsub command:

Code Block
qsub VirReport_nextflow.sh

Outputs

Nextflow folder structure:

-Work folder where the intermediate results are kept so you can easily resume execution from the last successfully executed step. This can be deleted once analysis is finalised

-Results folder where all the generated data files to be kept are saved

VirReport results folder structure

Under the Results folder, the folders are structured as follows:

Code Block
results/
├── 00_quality_filtering
│   └── sample_name
│   │   ├── sample_name_18-25nt_cutadapt.log
│   │   ├── sample_name_fastqc.html
│   │   ├── sample_name_fastqc.zip
│   │   ├── sample_name_21-22nt_cutadapt.log
│   │   ├── sample_name_21-22nt.fastq.gz
│   │   ├── sample_name_24nt_cutadapt.log
│   │   ├── sample_name_blacklist_filter.log
│   │   ├── sample_name_fastp.html
│   │   ├── sample_name_fastp.json
│   │   ├── sample_name_qual_filtering_cutadapt.log
│   │   ├── sample_name_quality_trimmed_fastqc.html
│   │   ├── sample_name_quality_trimmed_fastqc.zip
│   │   ├── sample_name_quality_trimmed.fastq.gz
│   │   ├── sample_name_read_length_dist.pdf
│   │   ├── sample_name_read_length_dist.txt
│   │   ├── sample_name_truseq_adapter_cutadapt.log
│   │   └── sample_name_umi_tools.log
│   └── qc_report
│       ├── read_origin_counts.txt
│       ├── read_origin_detailed_pc.txt
│       ├── read_origin_pc_summary.txt
│       ├── read_origin_pc_summary.txt
│       ├── run_qc_report.txt
│       └── run_read_size_distribution.pdf
├── 01_VirReport
│   └── sample_name
│   │   └── alignments
│   │   │   └──NT
│   │   │   │   ├── sample_name_21-22nt_all_targets_with_scores.txt
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_bowtie_log.txt
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name.consensus.fasta
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name.dedup.bam
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name.dedup.bam.bai
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name.fa
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name.fa.fai
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_norm.bcf
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_norm.bcf.csi
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_norm_flt_indels.bcf
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_norm_flt_indels.bcf.csi
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_picard_metrics.txt
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_sequence_variants.vcf.gz
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_sequence_variants.vcf.gz.csi
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_umi_tools.log
│   │   │   │   ├── sample_name_21-22nt_top_scoring_targets.txt
│   │   │   │   └── sample_name_21-22nt_top_scoring_targets_with_cov_stats.txt
│   │   │   └──viral_db
│   │   │   │   ├── sample_name_21-22nt_all_targets_with_scores.txt
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_bowtie_log.txt
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name.consensus.fasta
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name.dedup.bam
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name.dedup.bam.bai
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name.fa
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name.fa.fai
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_norm.bcf
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_norm.bcf.csi
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_norm_flt_indels.bcf
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_norm_flt_indels.bcf.csi
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_picard_metrics.txt
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_sequence_variants.vcf.gz
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_sequence_variants.vcf.gz.csi
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_umi_tools.log
│   │   │   │   └── sample_name_21-22nt_top_scoring_targets_with_cov_stats_viraldb.txt
│   │   └── assembly
│   │   │   ├── sample_name_cap3_21-22nt.fasta
│   │   │   ├── sample_name_spades_assembly_21-22nt.fasta
│   │   │   ├── sample_name_spades_log
│   │   │   ├── sample_name_velvet_assembly_21-22nt.fasta
│   │   │   └── sample_name_velvet_log
│   │   └──blastn
│   │   │   └── NT
│   │   │   │   ├── sample_name_cap3_21-22nt_blastn_vs_NT.bls
│   │   │   │   ├── sample_name_cap3_21-22nt_blastn_vs_NT_top5Hits.txt
│   │   │   │   ├── sample_name_cap3_21-22nt_blastn_vs_NT_top5Hits_virus_viroids_final.txt
│   │   │   │   ├── sample_name_cap3_21-22nt_blastn_vs_NT_top5Hits_virus_viroids_seq_ids_taxonomy.txt
│   │   │   │   └── summary_sample_name_cap3_21-22nt_blastn_vs_NT_top5Hits_virus_viroids_final.txt
│   │   │   └── viral_db
│   │   │       ├── sample_name_cap3_21-22nt_blastn_vs_viral_db.bls
│   │   │       ├── sample_name_cap3_21-22nt_megablast_vs_viral_db.bls
│   │   │       ├── summary_sample_name_cap3_21-22nt_blastn_vs_viral_db.bls_filtered.txt
│   │   │       ├── summary_sample_name_cap3_21-22nt_blastn_vs_viral_db.bls_viruses_viroids_ICTV.txt
│   │   │       ├── summary_sample_name_cap3_21-22nt_megablast_vs_viral_db.bls_filtered.txt
│   │   │       └── summary_sample_name_cap3_21-22nt_megablast_vs_viral_db.bls_viruses_viroids_ICTV.txt
│   │   ├── blastx
│   │   │   └── NT
│   │   │       ├── sample_name_cap3_21-22nt_blastx_vs_NT.bls
│   │   │       ├── sample_name_cap3_21-22nt_blastx_vs_NT_top5Hits.txt
│   │   │       ├── sample_name_cap3_21-22nt_blastx_vs_NT_top5Hits_virus_viroids_final.txt
│   │   │       └── summary_sample_name_cap3_21-22nt_blastx_vs_NT_top5Hits_virus_viroids_final.txt
│   │   └── tblastn
│   │       └── viral_db
│   │           ├── sample_name_cap3_21-22nt_getorf.all.fasta
│   │           ├── sample_name_cap3_21-22nt_getorf.all_tblastn_vs_viral_db_out.bls
│   │           └── sample_name_cap3_21-22nt_getorf.all_tblastn_vs_viral_db_top5Hits_virus_viroids_final.txt
│   └── Summary
│       ├── run_top_scoring_targets_with_cov_stats_with_cont_flag_FPKM_0.01_21-22nt.txt
│       └── run_top_scoring_targets_with_cov_stats_with_cont_flag_FPKM_0.01_21-22nt_viral_db.txt
└── 02_VirusDetect
    └── sample_name
    │   ├── blastn.reference.fa
    │   ├── blastn_references
    │   ├── blastx.reference.fa
    │   ├── blastx_references
    │   ├── contig_sequences.blastn.fa
    │   ├── contig_sequences.blastx.fa
    │   ├── contig_sequences.fa
    │   ├── contig_sequences.undetermined.fa
    │   ├── sample_name_21-22nt.blastn.html
    │   ├── sample_name_21-22nt.blastn.sam
    │   ├── sample_name_21-22nt.blastn_spp.txt
    │   ├── sample_name_21-22nt.blastn.summary.filtered.txt
    │   ├── sample_name_21-22nt.blastn.summary.txt
    │   ├── sample_name_21-22nt.blastn.txt
    │   ├── sample_name_21-22nt.blastx.html
    │   ├── sample_name_21-22nt.blastx.sam
    │   ├── sample_name_21-22nt.blastx.summary.txt
    │   └── sample_name_21-22nt.blastx.txt
    └── Summary
        ├── run_summary_top_scoring_targets_virusdetect_21-22nt_filtered.txt
        └── run_summary_virusdetect_21-22nt.txt
-Under the 00_quality_filtering folder:

◦ a folder is created for each sample which contains zipped quality filtered fastq files, associated QC files and logs

◦ under the QC_report folder, read size distribution pdf file and read RNA source pdf file are created. The folder also includes a run_qc_report text file

...

Image AddedImage Added

Image Added

-Under the 01_VirReport folder:

For each sample:

  • assembly: results associated with de novo assembly

  • blastn: megablast results (NCBI NT or viral database PVirDB)

  • blastx: blastx results against NR

  • tblastn: tblastn results against viral database PVirDB

  • alignments: alignment against top reference hit and associated statistic derivation

  • Summary

...

Definitions of terms used in summary report: 

  • sacc  Accession number of best homology match recovered

  • av-pident  Average per cent identity of all de novo assembled contigs to the same top reference hit

  • Mean read depth  The mean coverage in bases to the genome/sequence of the best homology match

  • Dedup read count  Read counts after PCR duplicates sharing UMIs are collapsed

  • Dup %  Duplication rate detected using UMIs

  • FPKM:  Fragments Per Kilobase of transcript, per Million mapped reads is a normalised unit of

  •   transcript expression. It scales by transcript length to compensate for the fact that most

  •   RNA-seq protocols will generate more sequencing reads from longer RNA molecules

  •   [deduplicated read count x 10^3 x 10^6]/[total quality filtered reads x genome length]

  • % bases 5X  The fraction of bases that attained at least 5X sequence coverage

  • % bases 10X  The fraction of bases that attained at least 10X sequence coverage

  • Contamination flag  Assumption: If a pest is present at high titer in a given sample and detection of reads matching to this pathogen in other samples occur at a significantly lower abundance, there is a risk that this lower signal is due to contamination (e.g. index hopping from high-titer sample). We first calculate the maximum FPKM value recorded for each virus and viroid identified on a run. If for a given virus, the FPKM value reported for a sample represented less than a percentage of this maximum FPKM value, it is then flagged as a contamination event. We apply 0.1% threshold value as default. This is just indicative and method cannot discriminate between false positives and viruses present at very low titer in a plant. It is then recommended to compare the sequences obtained, check the SNPs and validate through independent method