Introduction
eresearchqut/VirReport is a bioinformatics pipeline based upon the scientific workflow manager Nextflow. It was designed to help phytosanitary diagnostics of viruses and viroid pathogens in quarantine facilities. It takes small RNA-Seq fastq files as input. These can either be in raw format (currently only samples specifically prepared with the QIAGEN QIAseq miRNA library kit can be processed this way) or quality-filtered.
The pipeline can either perform blast homology searches against a virus database or/and a local copy of NCBI nr and nt databases.
Nextflow is a workflow management software which enables the writing of scalable and reproducible scientific workflows. It can integrate various software package and environment management systems such as Docker, Singularity, and Conda. It allows for existing pipelines written in common scripting languages, such as Python and R, to be seamlessly coupled together. It implements a Domain Specific Language (DSL) that simplifies the implementation and running of workflows on cloud or high-performance computing (HPC) infrastructures. For a good introduction to Nextflow please refer to the following training materials:
https://www.nextflow.io/docs/latest/getstarted.html
https://carpentries-incubator.github.io/workflows-nextflow/
Pipeline prerequisites
Basic unix command line knowledge (https://researchcomputing.princeton.edu/education/external-online-resources/linux; https://swcarpentry.github.io/shell-novice/
Familiarity with one unix text editors (e.g. VIM ( https://bioinformatics.uconn.edu/vim-guide/; https://missing.csail.mit.edu/2020/editors/or Nano (https://engineering.purdue.edu/ECN/Support/KB/Docs/BasictutorialforNanouhttps://www.howtogeek.com/howto/42980/the-beginners-guide-to-nano-the-linux-command-line-text-editor/)
Java 11 or later, Nextflow, and Docker/Singularity/Conda to suit your environment.
Installing Java and Nextflow
You can follow the steps outlined in the Nexflow documentation to install Java and Nextflow on your local machine or server:https://www.nextflow.io/docs/latest/getstarted.html
This link specifically describes the steps to take to load Java and install Nextflow on our local HPC at QUT (Lyra): Nextflow
Installing a suitable environment management system
To run the VirReport pipeline, you will need to install a suitable environment management system such as Docker, Singularity or Conda to suit your environment.
We use conda or miniconda on our HPC.
Installing VirReport
The open-source VirReport code is available at https://github.com/eresearchqut/VirReport
Run the following command to get a copy of the source code:
git clone https://github.com/eresearchqut/VirReport.git
Running the pipeline
Testing the pipeline on minimal test dataset:
Running these test datasets requires 2 cpus and 8 Gb mem and should take less than 5 mins to complete.
Make sure you have your nextflow config file set to “local" mode to run these tests:
process { executor = 'local' beforeScript = { """ source $HOME/.bashrc source $HOME/.profile """ }
This first command will test your installation using a single quality filtered fastq file (called test.fastq.gz) derived from a sample infected with citrus exocortis viroid and will run VirReport using a mock ncbi database:
nextflow -c conf/test.config run eresearchqut/VirReport -profile test,{docker, singularity or conda}
This second command will test your installation using a pair of raw fastq files (called test_pair_1.fastq.gz and test_pair_2.fastq.gz) derived from a sample infected with citrus tristeza virus and will run VirReport using a mock viral database:
nextflow -c conf/test2.config run eresearchqut/VirReport -profile test2,{docker, singularity or conda}
If both of these tests finish successfully, this means that the pipeline was set up properly. You are now all set to analyse your own samples.
Running the pipeline with your own data
Provide an index.csv file
Create a TAB delimited text file that will be the input for the workflow. By default the pipeline will look for a file called “index.csv” in the base directory but you can specify any file name using the --indexfile [filename] in the nextflow run command. This text file requires the following columns (which needs to be included as a header): sampleid,samplepath
sampleid will be the sample name that will be given to the files created by the pipeline
samplepath is the full path to the fastq files that the pipeline requires as starting input
An index_example.csv is included in the base directory:
sampleid,samplepath MT212,/work/diagnostics/2021/MT212_21-22bp.fastq MT213,/work/diagnostics/2021/MT213_21-22bp.fastq
Provide a database
By default, the pipeline is set to run homology blast searches against a local plant virus/viroid database (this is set in the nextflow.config file with parameter
--virreport_viral_db = true
. You will need to provide this database to run the pipeline. You can either provide your own or use a curated database provided at https://github.com/maelyg/PVirDB.git . Ensure you use NCBI BLAST+ makeblastdb to create the database. For instance, to set up this database, you would take the following steps:git clone https://github.com/maelyg/PVirDB.git cd PVirDB gunzip PVirDB_v1.fasta.gz makeblastdb -in PVirDB_v1.fasta -parse_seqids -dbtype nucl
Then specify the full path to the database files including the prefix in the nextflow.config file. For example:
params { blast_local_db_path = '/path_to_viral_DB/viral_DB_name' }
If you also want to run homology searches against public NCBI databases, you need to set the parameter
virreport_ncbi
in the nextflow.config file totrue
:params { virreport_ncbi = true }
or add it in your nextflow command:
nextflow run eresearchqut/VirReport -profile {docker, singularity or conda} --virreport_ncbi
Download these locally, following the detailed steps available at https://www.ncbi.nlm.nih.gov/books/NBK569850/ . Create a folder where you will store your NCBI databases. It is good practice to include the date of download. For instance:
mkdir blastDB/30112021
You will need to use the update_blastdb.pl script from the blast+ version used with the pipeline.
For example:perl update_blastdb.pl --decompress nt [*] perl update_blastdb.pl --decompress nr [*] perl update_blastdb.pl taxdb tar -xzf taxdb.tar.gz
Make sure the taxdb.btd and the taxdb.bti files are present in the same directory as your blast databases.
Specify the path of your local NCBI blast nt and nr directories in the nextflow.config file.
For instance:params { blast_db_dir = '/work/hia_mt18005_db/blastDB/20220408' }
Run nextflow
nextflow run eresearchqut/VirReport -profile {docker, singularity or conda} --indexfile index.csv
On our HPC you can either specify singularity or conda as profile.
A cached environment will be built in your home directory under either the cached singularity or conda directory. This step might take some time the first time you run the pipeline.