nf-eresearch/VirReport - Diagnosis of plant viruses and viroids using small RNA-seq data

Aims:

Implement an end-to-end bioinformatics workflow that is reproducible, robust, scalable and compute infrastructure agnostic
Leverage from the host plant antiviral response pathway to increase sensitivity and specificity of pathogen detections
Prevent or minimise the reporting of cross-sample contaminations owing to index hopping events (false positive detections)

Pre-requisites

Installed conda3 or miniconda3 ( Installing on Linux — conda 24.9.3.dev45 documentation )
Basic unix command line knowledge (example: Learning Resources: the Linux Command Line ; The Unix Shell: Summary and Setup )
Familiarity with one unix text editors (example Vi/Vim or Nano):
- VIM ( VIM Guide | Computational Biology Core ; Editors (Vim))
- Nano (Basic tutorial for Nano users ; https://www.howtogeek.com/howto/42980/the-beginners-guide-to-nano-the-linux-command-line-text-editor/ )
Have an HPC account on QUT’s lyra. Apply for a new HPC account here.

Install nextflow: NextFlow quick start

Database

Custom virus database, please do not distribute to third parties. Location:

/work/img/databases/

Creating a local blast database

makeblastdb -in test.fasta -parse_seqids -dbtype nucl

Method

We will use two nextflow pipelines to process the Virome data, initially, we run trimgalore to filter out poor quality reads/bases and remove adapter sequences. Then we run VirReport to assess the presence of viruses and viroids.

1) Quality Control of Raw Files

First generate an ‘index.csv’ file that contains the Sample ID and path to the raw data file:

sampleId,read1
CB,/work/img/raw_data/AGRF_CAGRF22029755_H52LJDRX2/CB_H52LJDRX2_TCATGCGT_L001_R1.fastq.gz
CM,/work/img/raw_data/AGRF_CAGRF22029755_H52LJDRX2/CM_H52LJDRX2_CTGCATCA_L001_R1.fastq.gz
CP,/work/img/raw_data/AGRF_CAGRF22029755_H52LJDRX2/CP_H52LJDRX2_TCAGACTT_L001_R1.fastq.gz
TB1,/work/img/raw_data/AGRF_CAGRF22029755_H52LJDRX2/TB1_H52LJDRX2_TCACTACG_L001_R1.fastq.gz
TBG,/work/img/raw_data/AGRF_CAGRF22029755_H52LJDRX2/TBG_H52LJDRX2_CTTCACGA_L001_R1.fastq.gz
TM,/work/img/raw_data/AGRF_CAGRF22029755_H52LJDRX2/TM_H52LJDRX2_CGTTCTGC_L001_R1.fastq.gz
TP,/work/img/raw_data/AGRF_CAGRF22029755_H52LJDRX2/TP_H52LJDRX2_AAGTTATC_L001_R1.fastq.gz
TPS,/work/img/raw_data/AGRF_CAGRF22029755_H52LJDRX2/TPS_H52LJDRX2_CTTCTTAA_L001_R1.fastq.gz
TR1,/work/img/raw_data/AGRF_CAGRF22029755_H52LJDRX2/TR1_H52LJDRX2_TCAGTGAG_L001_R1.fastq.gz
TR2,/work/img/raw_data/AGRF_CAGRF22029755_H52LJDRX2/TR2_H52LJDRX2_TGACCGCG_L001_R1.fastq.gz

Create a PBS Pro submission script:

Submit the job to the HPC scheduler:

Check progress of the job:

2) Diagnosis of plant viruses and viroids

Installing VirReport

The open-source VirReport code is available at https://github.com/eresearchqut/VirReport

At the HPC, run the following command to get a copy of the source code:

Alternatively, run the following command to fetch and also test VirReport:

Note: the above command will store a cached copy of VirReport at '$HOME/.nextflow/NXF_SINGULARITY_CACHEDIR'

Running VirReport

Sample index file

To run VirReport it is required to create an 'index_samples.csv` that specifies the sample ID, path to raw data, minimal length, and the maximum length of reads to be used for diagnosis. For example:

You can modify the above template with your own samples. Note, the files above can be the trimgalore processed files.

2. Run VirReport using a PBS Pro script

Define nextflow configurations if different from provided template:

Prepare a PBS Pro submission script: