Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

1. Introduction

eresearchqut/VirReport is a bioinformatics pipeline based upon the scientific workflow manager Nextflow. It was designed to help phytosanitary diagnostics of viruses and viroid pathogens in quarantine facilities. It takes small RNA-Seq fastq files as input. These can either be in raw format (currently only samples specifically prepared with the QIAGEN QIAseq miRNA library kit can be processed this way) or quality-filtered.

...

https://carpentries-incubator.github.io/workflows-nextflow/

2. Pipeline summary

...

The VirReport workflow will perform the following steps by default:

...

  • Searches against local NCBI NT and NR databases:

    • Retain top 5 megablast hits and restrict results to virus and viroid matches. Summarise results by grouping all the de novo contigs matching to the same viral hit and deriving the cumulative blast coverage and percent ID for each viral hit (BLASTN_NT_CAP3)

    • Align reads to top hit, derive coverage statistics, consensus sequence and VCF matching to top blast hits (COVSTATS_NT)

    • Run blastx homolgy search on contigs >= 90 bp long for which no match was obtained in the megablast search. Summarise the blastx results and restrict to virus and viroid matches (BLASTX)

  • A quality filtering step on raw fastq files (currently the workflow only processes samples prepared using QIAGEN QIAseq miRNA library kit). After performing quality filtering (FASTQC_RAW, ADAPTER_AND_QUAL_TRIMMING, QC_POST_QUAL_TRIMMING, DERIVE_USABLE_READS). the pipeline will also derive a qc report (QCREPORT). An RNA souce profile can be included as part of this step (RNA_SOURCE_PROFILE, RNA_SOURCE_PROFILE_REPORT)

  • VirusDetect version 1.8 can also be run in parallel. A summary of the top virus/viroid blastn hits will be separately output (VIRUS_DETECT, VIRUS_IDENTIFY, VIRUS_DETECT_BLASTN_SUMMARY, VIRUS_DETECT_BLASTN_SUMMARY_FILTERED)

3. Pipeline prerequisites

3A. Installing Java and Nextflow

You can follow the steps outlined in the Nexflow documentation to install Java and Nextflow on your local machine or server:https://www.nextflow.io/docs/latest/getstarted.html

This link specifically describes the steps to take to load Java and install Nextflow on our local HPC at QUT (Lyra): Nextflow

3B. Installing a suitable environment management system

To run the VirReport pipeline, you will need to install a suitable environment management system such as Docker, Singularity or Conda to suit your environment.

We use conda or miniconda on our HPC.

3C. Installing VirReport

The open-source VirReport code is available at https://github.com/eresearchqut/VirReport

...

Code Block
git clone https://github.com/eresearchqut/VirReport.git

4. Running the pipeline

You can either invoke the pipeline by pointing to the location of main.nf in the version of VirReport you cloned, for example:

...

Cached environment will be built in your home directory under the cached singularity directory. This step will take some time the first time you run the pipeline.

4A. Testing the pipeline on minimal test dataset:

Running these test datasets requires 2 cpus and 8 Gb mem and should take less than 5 mins to complete.

...

You are now all set to analyse your own samples.

4B. Running the pipeline with your own data

Provide an index.csv file

...

Code Block
qsub VirReport_nextflow.sh

5. Outputs

5A. Nextflow folder structure:

-Work folder where the intermediate results are kept so you can easily resume execution from the last successfully executed step. This can be deleted once analysis is finalised

-Results folder where all the generated data files to be kept are saved

5B. VirReport results folder structure

Under the Results folder, the folders are structured as follows:

Code Block
results/
├── 00_quality_filtering
│   └── sample_name
│   │   ├── sample_name_18-25nt_cutadapt.log
│   │   ├── sample_name_fastqc.html
│   │   ├── sample_name_fastqc.zip
│   │   ├── sample_name_21-22nt_cutadapt.log
│   │   ├── sample_name_21-22nt.fastq.gz
│   │   ├── sample_name_24nt_cutadapt.log
│   │   ├── sample_name_blacklist_filter.log
│   │   ├── sample_name_fastp.html
│   │   ├── sample_name_fastp.json
│   │   ├── sample_name_qual_filtering_cutadapt.log
│   │   ├── sample_name_quality_trimmed_fastqc.html
│   │   ├── sample_name_quality_trimmed_fastqc.zip
│   │   ├── sample_name_quality_trimmed.fastq.gz
│   │   ├── sample_name_read_length_dist.pdf
│   │   ├── sample_name_read_length_dist.txt
│   │   ├── sample_name_truseq_adapter_cutadapt.log
│   │   └── sample_name_umi_tools.log
│   └── qc_report
│       ├── read_origin_counts.txt
│       ├── read_origin_detailed_pc.txt
│       ├── read_origin_pc_summary.txt
│       ├── read_origin_pc_summary.txt
│       ├── run_qc_report.txt
│       └── run_read_size_distribution.pdf
├── 01_VirReport
│   └── sample_name
│   │   └── alignments
│   │   │   └──NT
│   │   │   │   ├── sample_name_21-22nt_all_targets_with_scores.txt
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_bowtie_log.txt
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name.consensus.fasta
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name.dedup.bam
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name.dedup.bam.bai
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name.fa
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name.fa.fai
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_norm.bcf
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_norm.bcf.csi
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_norm_flt_indels.bcf
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_norm_flt_indels.bcf.csi
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_picard_metrics.txt
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_sequence_variants.vcf.gz
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_sequence_variants.vcf.gz.csi
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_umi_tools.log
│   │   │   │   ├── sample_name_21-22nt_top_scoring_targets.txt
│   │   │   │   └── sample_name_21-22nt_top_scoring_targets_with_cov_stats.txt
│   │   │   └──viral_db
│   │   │   │   ├── sample_name_21-22nt_all_targets_with_scores.txt
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_bowtie_log.txt
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name.consensus.fasta
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name.dedup.bam
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name.dedup.bam.bai
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name.fa
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name.fa.fai
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_norm.bcf
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_norm.bcf.csi
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_norm_flt_indels.bcf
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_norm_flt_indels.bcf.csi
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_picard_metrics.txt
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_sequence_variants.vcf.gz
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_sequence_variants.vcf.gz.csi
│   │   │   │   ├── sample_name_21-22nt_GenBankID_virus_name_umi_tools.log
│   │   │   │   └── sample_name_21-22nt_top_scoring_targets_with_cov_stats_viraldb.txt
│   │   └── assembly
│   │   │   ├── sample_name_cap3_21-22nt.fasta
│   │   │   ├── sample_name_spades_assembly_21-22nt.fasta
│   │   │   ├── sample_name_spades_log
│   │   │   ├── sample_name_velvet_assembly_21-22nt.fasta
│   │   │   └── sample_name_velvet_log
│   │   └──blastn
│   │   │   └── NT
│   │   │   │   ├── sample_name_cap3_21-22nt_blastn_vs_NT.bls
│   │   │   │   ├── sample_name_cap3_21-22nt_blastn_vs_NT_top5Hits.txt
│   │   │   │   ├── sample_name_cap3_21-22nt_blastn_vs_NT_top5Hits_virus_viroids_final.txt
│   │   │   │   ├── sample_name_cap3_21-22nt_blastn_vs_NT_top5Hits_virus_viroids_seq_ids_taxonomy.txt
│   │   │   │   └── summary_sample_name_cap3_21-22nt_blastn_vs_NT_top5Hits_virus_viroids_final.txt
│   │   │   └── viral_db
│   │   │       ├── sample_name_cap3_21-22nt_blastn_vs_viral_db.bls
│   │   │       ├── sample_name_cap3_21-22nt_megablast_vs_viral_db.bls
│   │   │       ├── summary_sample_name_cap3_21-22nt_blastn_vs_viral_db.bls_filtered.txt
│   │   │       ├── summary_sample_name_cap3_21-22nt_blastn_vs_viral_db.bls_viruses_viroids_ICTV.txt
│   │   │       ├── summary_sample_name_cap3_21-22nt_megablast_vs_viral_db.bls_filtered.txt
│   │   │       └── summary_sample_name_cap3_21-22nt_megablast_vs_viral_db.bls_viruses_viroids_ICTV.txt
│   │   ├── blastx
│   │   │   └── NT
│   │   │       ├── sample_name_cap3_21-22nt_blastx_vs_NT.bls
│   │   │       ├── sample_name_cap3_21-22nt_blastx_vs_NT_top5Hits.txt
│   │   │       ├── sample_name_cap3_21-22nt_blastx_vs_NT_top5Hits_virus_viroids_final.txt
│   │   │       └── summary_sample_name_cap3_21-22nt_blastx_vs_NT_top5Hits_virus_viroids_final.txt
│   │   └── tblastn
│   │       └── viral_db
│   │           ├── sample_name_cap3_21-22nt_getorf.all.fasta
│   │           ├── sample_name_cap3_21-22nt_getorf.all_tblastn_vs_viral_db_out.bls
│   │           └── sample_name_cap3_21-22nt_getorf.all_tblastn_vs_viral_db_top5Hits_virus_viroids_final.txt
│   └── Summary
│       ├── run_top_scoring_targets_with_cov_stats_with_cont_flag_FPKM_0.01_21-22nt.txt
│       └── run_top_scoring_targets_with_cov_stats_with_cont_flag_FPKM_0.01_21-22nt_viral_db.txt
└── 02_VirusDetect
    └── sample_name
    │   ├── blastn.reference.fa
    │   ├── blastn_references
    │   ├── blastx.reference.fa
    │   ├── blastx_references
    │   ├── contig_sequences.blastn.fa
    │   ├── contig_sequences.blastx.fa
    │   ├── contig_sequences.fa
    │   ├── contig_sequences.undetermined.fa
    │   ├── sample_name_21-22nt.blastn.html
    │   ├── sample_name_21-22nt.blastn.sam
    │   ├── sample_name_21-22nt.blastn_spp.txt
    │   ├── sample_name_21-22nt.blastn.summary.filtered.txt
    │   ├── sample_name_21-22nt.blastn.summary.txt
    │   ├── sample_name_21-22nt.blastn.txt
    │   ├── sample_name_21-22nt.blastx.html
    │   ├── sample_name_21-22nt.blastx.sam
    │   ├── sample_name_21-22nt.blastx.summary.txt
    │   └── sample_name_21-22nt.blastx.txt
    └── Summary
        ├── run_summary_top_scoring_targets_virusdetect_21-22nt_filtered.txt
        └── run_summary_virusdetect_21-22nt.txt
-Under the 00_quality_filtering folder:

◦ a folder is created for each sample which contains zipped quality filtered fastq files, associated QC files and logs

◦ under the QC_report folder, read size distribution pdf file and read RNA source pdf file are created. The folder also includes a run_qc_report text file

...

Image RemovedImage Added

-Under the 01_VirReport folder:

...