Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

eresearchqut/VirReport is a bioinformatics pipeline based upon the scientific workflow manager Nextflow. It was has been designed to help phytosanitary diagnostics of viruses and viroid pathogens in quarantine facilities. It takes small RNA-Seq fastq files as input. These can either be in raw format (currently only samples specifically prepared with the QIAGEN QIAseq miRNA library kit can be processed this way) or quality-filtered.

For target identification, VirReport uses an hybrid de novo assembly approach to build contigs that are then annotated using blast homology searches against either a virus database or/and a local copy of NCBI nr and nt databases.

...

  • Retain reads of a given length (21-22 nt long by default) from fastq file(s) provided in the index.csv file (READPROCESSING)

  • De novo assembly using both Velvet and SPAdes. The contigs obtained are collapsed into scaffolds using cap3. By default, only contigs > 30 40 bp will be retained (DENOVO_ASSEMBLY)

  • Run megablast homology search against either a local virus database or NCBI NT/NR databases:

  • Searches against a local virus database:

    • Run megablast homology searches on de novo assembly against local virus and viroid database. Homology searches against blastn are also run in parallel for comparison with the megablast algorithm (BLAST_NTBLASTN_VIRAL_DB_CAP3)

    • Retain top megablast hit and restrict results to virus and viroid matches. Summarise results by grouping all the de novo contigs matching to the same viral hit and deriving the cumulative blast coverage and percent ID identity for each viral hit (FILTER_BLAST_NTBLASTN_VIRAL_DB_CAP3)

    • Align reads to top hit, derive coverage statistics , and consensus sequence and VCF matching to top blast hit (FILTER_BLAST_NT_VIRAL_DB_CAP3, COVSTATS_VIRAL_DB)

    • Run tblastn homolgy search on predicted ORF >= 90 bp derived using getORF (TBLASTN_VIRAL_DB)

...

  • Searches against local NCBI NT and NR databases:

    • Retain top 5 megablast hits and restrict results to virus and viroid matches. Summarise results by grouping all the de novo contigs matching to the same viral hit and deriving the cumulative blast coverage and percent ID for each viral hit (BLASTN_NT_CAP3)

    • Align reads to top hit, derive coverage statistics, consensus sequence and VCF matching to top blast hits (COVSTATS_NT)

    • Run blastx homolgy search on contigs >= 90 bp long for which no match was obtained in the megablast search. Summarise the blastx results and restrict to virus and viroid matches (BLASTX)

  • The pipeline can perform additional optional steps, which include:

  • A quality filtering step on raw fastq files (currently the workflow only processes samples prepared using QIAGEN QIAseq miRNA library kit). After performing quality filtering (FASTQC_RAW, ADAPTER_AND_QUAL_TRIMMING, QCQUAL_POSTTRIMING_QUALAND_TRIMMINGQC, DERIVE_USABLE_READS). the The pipeline will also derive a qc report (QCREPORT). An RNA souce profile can also be included as part of this the quality filtering step (RNA_SOURCE_PROFILE, RNA_SOURCE_PROFILE_REPORT)

  • VirusDetect version 1.8 can also be run in parallel. A summary of the top virus/viroid blastn hits will be separately output (VIRUS_DETECT, VIRUS_IDENTIFY, VIRUS_DETECT_BLASTN_SUMMARY, VIRUS_DETECT_BLASTN_SUMMARY_FILTERED)

...

To run the VirReport pipeline, you will need to install a suitable environment management system such as Docker, Singularity or Conda to suit your environment.

We use conda or miniconda on our HPCrecommnend using Singularity.

3C. Installing VirReport

...

Code Block
nextflow run eresearchqut/VirReport -profile {docker, singularity or conda}

On our HPC, specify singularity We recommend to use Singularity as the profile.

Cached environment will be built in your home directory under the cached singularity directory. This step will take some time the first time you run the pipeline.

...

  • By default, the pipeline is set to run homology blast searches against a local plant virus/viroid database (this is set in the nextflow.config file with parameter --virreport_viral_db = true. You will need to provide this database to run the pipeline. You can either provide your own or use a A curated database is provided at https://github.com/maelyg/PVirDB.git. Ensure you use NCBI BLAST+ makeblastdb to create the database. For instance, to set up this database, you would take the following steps:

    Code Block
    git clone https://github.com/maelyg/PVirDB.git
    cd PVirDB
    gunzip PVirDB_v1.fasta.gz
    makeblastdb -in PVirDB_v1.fasta -parse_seqids -dbtype nucl

    Then specify the full path to the database files including the prefix in the nextflow.config file. For example:

    Code Block
    params {
      blast_local_db_path = '/path_to_viral_DB/viral_DB_name'
    }
  • If you also want to run homology searches against public NCBI databases, you need to set the parameter virreport_ncbi in the nextflow.config file to true:

    Code Block
    params {
      virreport_ncbi = true
    }

    or add it in your nextflow command:

    Code Block
    nextflow run eresearchqut/VirReport -profile {docker, singularity or conda} --virreport_ncbi

    Download these locally, following the detailed steps available at https://www.ncbi.nlm.nih.gov/books/NBK569850/ . Create a folder where you will store your NCBI databases. It is good practice to include the date of download. For instance:

    Code Block
    mkdir blastDB/30112021

    You will need to use the update_blastdb.pl script from the blast+ version used with the pipeline.
    For example:

    Code Block
    perl update_blastdb.pl --decompress nt [*]
    perl update_blastdb.pl --decompress nr [*]
    perl update_blastdb.pl taxdb
    tar -xzf taxdb.tar.gz

    Make sure the taxdb.btd and the taxdb.bti files are present in the same directory as your blast databases.
    Specify the path of your local NCBI blast nt and nr directories in the nextflow.config file.
    For instance:

    Code Block
    params {
      blast_db_dir = '/work/hia_mt18005_db/blastDB/20220408'
    }

...

  • By default the pipeline expects a single quality-filtered fastq file per sample.

  • If you want to provide raw fastq files, samples have to be specifically prepared with the QIAGEN QIAseq miRNA library kit. If you want to run the initial quality filtering step on your raw fastq files, you will need to set the --qualityfilter paramater to true in the config.file and specify the path to the directory which holds the required bowtie indices (using the --bowtie_db_dir parameter) to: 1) filter non-informative reads (using the blacklist bowtie indices for the DERIVE_USABLE_READS process) and 2) optionally derive the origin of the filtered reads obtained (RNA_SOURCE_PROFILE process).

    The required fasta files are available at https://github.com/maelyg/bowtie_indices.git and bowtie indices can be built from these using the command:

    Code Block
    git clone https://github.com/maelyg/bowtie_indices.git
    gunzip blacklist_v2.fasta.gz
    #you might need to activate your environment cached in either your conda or singularity environment in order to run bowtie
    #for example
    conda activate /path_to_cached_environment/virreport-77d02f3abe1d8ba5f8dfdff194142de9
    #then run the bowtie command
    bowtie-build -f blacklist_v2.fasta blasklist

    The directory in which the bowtie indices are located will need to be specified in the nextflow.config file:

    Code Block
    params {
      bowtie_db_dir = '/path_to_bowtie_idx_directory'
    }

    If you are interested to derive an RNA source profile of your fastq files you will need to specify:

    Code Block
    params {
      rna_source_profile = true
    }

    And build the other indices from the fasta files included in https://github.com/maelyg/bowtie_indices.git (i.e. rRNA, plant_tRNA, plant_noncoding, plant_pt_mt_other_genes, artefacts, plant_ miRNA, virus).

    The quality filtering step will create the 00_quality_filtering folder under the results folder:

    Code Block
    results/
    ├── 00_quality_filtering
        └── sample_name
        │   ├── sample_name_18-25nt_cutadapt.log
        │   ├── sample_name_fastqc.html
        │   ├── sample_name_fastqc.zip
        │   ├── sample_name_21-22nt_cutadapt.log
        │   ├── sample_name_21-22nt.fastq.gz
        │   ├── sample_name_24nt_cutadapt.log
        │   ├── sample_name_blacklist_filter.log
        │   ├── sample_name_fastp.html
        │   ├── sample_name_fastp.json
        │   ├── sample_name_qual_filtering_cutadapt.log
        │   ├── sample_name_quality_trimmed_fastqc.html
        │   ├── sample_name_quality_trimmed_fastqc.zip
        │   ├── sample_name_quality_trimmed.fastq.gz
        │   ├── sample_name_read_length_dist.pdf
        │   ├── sample_name_read_length_dist.txt
        │   ├── sample_name_truseq_adapter_cutadapt.log
        │   └── sample_name_umi_tools.log
        └── qc_report
            ├── read_origin_counts.txt
            ├── read_origin_detailed_pc.txt
            ├── read_origin_pc_summary.txt
            ├── run_qc_report.txt
            └── run_read_size_distribution.pdf

    If your sequencing run was split on multiple lanes, you might have several raw fastq files per sample, and you can directly feed these to the pipeline and specify the --merge-lane parameter. The fastq files will be collapsed to one fastq file before performing downstream analysis. The sample name used will be the sampleid provided in the index.csv file. In the example below 2 fastq files were generated for 1 sample named CT103:

...

Code Block
nextflow run eresearchqut/VirReport -profile singularity -resume --indexfile index.csv \
                                    --merge_lane --qualityfilter --rna_source_profile \
                                    --bowtie_db_dir /path_to_bowtie_indices \
                                    --virreport_ncbi --blast_viral_db_pathdir /path_to_ncbi_databases
Deduplicate reads using unique molecular identifiers and mapping coordinates

...

If you want to derive a summary of detections for all the samples included in the index file, specify the --contaminationdetecion_detectionreporting_viral_db or the --contaminationdetection_detectionreporting_ncbint option. This will create a summary text file under the Summary tab with a column called contamination_flag

...

If you want to run VirReport in diagnostics mode (--diagno), the pipeline will also add an evidence category (ie KNOWN, KNOWN_FRAGMENT and PUTATIVECANDIDATE_NOVEL) to each detection based on av-pident and % bases 10X.

...

Finally, if sample information is provided (--sampleinfo --sampleinfo_path /path/to/sampleinfo.txt and --samplesheet_path /path/to/Sample_Sheet.csv), this will be added to the final summary.

...

Code Block
#!/bin/bash -l
#PBS -N VirReport
#PBS -l select=1:ncpus=2:mem=8gb
#PBS -l walltime=05:00:00


cd $PBS_O_WORKDIR
module load java
NXF_OPTS='-Xms1g -Xmx4g'

nextflow run eresearchqut/VirReport -profile singularity -resume --indexfile index.csv \
                                    --merge_lane --qualityfilter --rna_source_profile \
                                    --bowtie_db_dir /path_to_bowtie_indices \
                                    --dedup \
                                    --virreport_viral_db --blast_viral_db_path /path_to_local_viral_database \
                                    --contaminationdetecion_detectionreporting_viral_db

Example 2:

In the PBS job below, blastn (using the megablast algorithm) and blastx homology searches will be run against NCBI NR and NT respectively.

Code Block
#!/bin/bash -l
#PBS -N VirReport
#PBS -l select=1:ncpus=2:mem=8gb
#PBS -l walltime=05:00:00


cd $PBS_O_WORKDIR
module load java
NXF_OPTS='-Xms1g -Xmx4g'

nextflow run eresearchqut/VirReport -profile singularity -resume --indexfile index.csv \
                                    --merge_lane --qualityfilter --rna_source_profile \
                                    --bowtie_db_dir /path_to_bowtie_indices \
                                    --dedup \
                                    --virreport_ncbi --blast_viral_db_path /path_to_ncbi_databases \
                                    --contaminationdetecion_reporting_detectionnt

Example 3:

In the PBS job below, homology searches will be run against NCBI and the PVirDB. The pipeline will also run VirusDetect in parallel.

Code Block
#!/bin/bash -l
#PBS -N VirReport
#PBS -l select=1:ncpus=2:mem=8gb
#PBS -l walltime=05:00:00


cd $PBS_O_WORKDIR
module load java
NXF_OPTS='-Xms1g -Xmx4g'

nextflow run eresearchqut/VirReport -profile singularity -resume --indexfile index.csv \
                                    --merge_lane --qualityfilter --rna_source_profile \
                                    --bowtie_db_dir /path_to_bowtie_indices \
                                    --dedup \
                                    --virreport_ncbi --blast_viral_db_path /path_to_ncbi_databases --contamination_detectiondetecion_reporting_viral_nt \
                                    --virreport_viral_db --blast_viral_db_path /path_to_local_viral_database --contaminationdetecion_detectionreporting_viral_db \
                                    --virusdetect --virusdetect_db_path

...