1. Introduction
eresearchqut/VirReport is a bioinformatics pipeline based upon the scientific workflow manager Nextflow. It has been designed to help phytosanitary diagnostics of viruses and viroid pathogens in quarantine facilities. It takes small RNA-Seq fastq files as input. These can either be in raw format (currently only samples specifically prepared with the QIAGEN QIAseq miRNA library kit can be processed this way) or quality-filtered.
For target identification, VirReport uses an hybrid de novo assembly approach to build contigs that are then annotated using blast homology searches against either a virus database or/and a local copy of NCBI nr and nt databases.
Nextflow is a workflow management software which enables the writing of scalable and reproducible scientific workflows. It can integrate various software package and environment management systems such as Docker, Singularity, and Conda. It allows for existing pipelines written in common scripting languages, such as Python and R, to be seamlessly coupled together. It implements a Domain Specific Language (DSL) that simplifies the implementation and running of workflows on cloud or high-performance computing (HPC) infrastructures. For a good introduction to Nextflow please refer to the following training materials:
https://www.nextflow.io/docs/latest/getstarted.html
https://carpentries-incubator.github.io/workflows-nextflow/
2. Pipeline summary
The VirReport workflow will perform the following steps by default:
Retain reads of a given length (21-22 nt long by default) from fastq file(s) provided in the index.csv file (READPROCESSING)
De novo assembly using both Velvet and SPAdes. The contigs obtained are collapsed into scaffolds using cap3. By default, only contigs > 40 bp will be retained (DENOVO_ASSEMBLY)
Run megablast homology search against either a local virus database or NCBI NT/NR databases:
Searches against a local virus database:
Run megablast homology searches on de novo assembly against local virus and viroid database. Homology searches against blastn are also run in parallel for comparison with the megablast algorithm (BLASTN_VIRAL_DB_CAP3)
Retain top megablast hit and restrict results to virus and viroid matches. Summarise results by grouping all the de novo contigs matching to the same viral hit and deriving the cumulative blast coverage and percent identity for each viral hit (FILTER_BLASTN_VIRAL_DB_CAP3)
Align reads to top hit, derive coverage statistics and consensus sequence matching to top blast hit (FILTER_BLAST_NT_VIRAL_DB_CAP3, COVSTATS_VIRAL_DB)
Run tblastn homolgy search on predicted ORF >= 90 bp derived using getORF (TBLASTN_VIRAL_DB)
Searches against local NCBI NT and NR databases:
Retain top 5 megablast hits and restrict results to virus and viroid matches. Summarise results by grouping all the de novo contigs matching to the same viral hit and deriving the cumulative blast coverage and percent ID for each viral hit (BLASTN_NT_CAP3)
Align reads to top hit, derive coverage statistics, consensus sequence and VCF matching to top blast hits (COVSTATS_NT)
Run blastx homolgy search on contigs >= 90 bp long for which no match was obtained in the megablast search. Summarise the blastx results and restrict to virus and viroid matches (BLASTX)
The pipeline can perform additional optional steps, which include:
A quality filtering step on raw fastq files (currently the workflow only processes samples prepared using QIAGEN QIAseq miRNA library kit). After performing quality filtering (FASTQC_RAW, ADAPTER_TRIMMING, QUAL_TRIMING_AND_QC, DERIVE_USABLE_READS). The pipeline will also derive a qc report (QCREPORT). An RNA souce profile can also be included as part of the quality filtering step (RNA_SOURCE_PROFILE, RNA_SOURCE_PROFILE_REPORT)
VirusDetect version 1.8 can also be run in parallel. A summary of the top virus/viroid blastn hits will be separately output (VIRUS_DETECT, VIRUS_IDENTIFY, VIRUS_DETECT_BLASTN_SUMMARY, VIRUS_DETECT_BLASTN_SUMMARY_FILTERED)
3. Pipeline prerequisites
Basic unix command line knowledge (https://researchcomputing.princeton.edu/education/external-online-resources/linux; https://swcarpentry.github.io/shell-novice/
Familiarity with one unix text editors (e.g. VIM ( https://bioinformatics.uconn.edu/vim-guide/; https://missing.csail.mit.edu/2020/editors/or Nano (https://engineering.purdue.edu/ECN/Support/KB/Docs/BasictutorialforNanouhttps://www.howtogeek.com/howto/42980/the-beginners-guide-to-nano-the-linux-command-line-text-editor/)
Java 11 or later, Nextflow, and Docker/Singularity/Conda to suit your environment.
3A. Installing Java and Nextflow
You can follow the steps outlined in the Nexflow documentation to install Java and Nextflow on your local machine or server:https://www.nextflow.io/docs/latest/getstarted.html
This link specifically describes the steps to take to load Java and install Nextflow on our local HPC at QUT (Lyra): Nextflow
3B. Installing a suitable environment management system
To run the VirReport pipeline, you will need to install a suitable environment management system such as Docker, Singularity or Conda to suit your environment.
We recommnend using Singularity.
3C. Installing VirReport
The open-source VirReport code is available at https://github.com/eresearchqut/VirReport
Run the following command to get a copy of the source code:
git clone https://github.com/eresearchqut/VirReport.git
All the required modules are included in the environment.yml file in the pipeline base directory and shows all the tools used in the pipeline.
4. Running the pipeline
You can either invoke the pipeline by pointing to the location of main.nf in the version of VirReport you cloned, for example:
nextflow run ~/path_to_local_VirReport_copy/main.nf
or run directly eresearchqut/VirReport
nextflow run eresearchqut/VirReport
You will have to specify a profile to use to run the pipeline
nextflow run eresearchqut/VirReport -profile {docker, singularity or conda}
We recommend to use Singularity as the profile.
Cached environment will be built in your home directory under the cached singularity directory. This step will take some time the first time you run the pipeline.
4A. Testing the pipeline on minimal test dataset:
Running these test datasets requires 2 cpus and 8 Gb mem and should take less than 5 mins to complete.
Make sure you have your nextflow config file set to “local" mode to run these tests:
process { executor = 'local' beforeScript = { """ source $HOME/.bashrc source $HOME/.profile """ }
This first command will test your installation using a single quality filtered fastq file (called test.fastq.gz) derived from a sample infected with citrus exocortis viroid and will run VirReport using a mock ncbi database (we recommend to use singularity on our local Lyra server at QUT):
nextflow -c conf/test.config run eresearchqut/VirReport -profile test,{docker, singularity or conda} -latest -r main
This second command will test your installation using a pair of raw fastq files (called test_pair_1.fastq.gz and test_pair_2.fastq.gz) derived from a sample infected with citrus tristeza virus and will run VirReport using a mock viral database (we recommend to use singularity on our local Lyra server at QUT):
nextflow -c conf/test2.config run eresearchqut/VirReport -profile test2,{docker, singularity or conda} -latest -r main
If both of these tests finish successfully, this means that the pipeline was set up properly.
You can have a look at the files that have been created under the results folder.
You are now all set to analyse your own samples.
4B. Running the pipeline with your own data
Provide an index.csv file
Create a TAB delimited text file that will be the input for the workflow to run. By default the pipeline will look for a file called “index.csv” in the base directory but you can specify any file name using the --indexfile [filename] in the nextflow run command. This text file requires the following columns (which needs to be included as a header): sampleid,samplepath
sampleid will be the sample name that will be given to the files created by the pipeline
samplepath is the full path to the fastq files that the pipeline requires as starting input
An index_example.csv is included in the base directory:
sampleid,samplepath MT212,/work/diagnostics/2021/MT212_21-22bp.fastq MT213,/work/diagnostics/2021/MT213_21-22bp.fastq
If you need to set additional parameters, you can either include these in your nextflow run command:
nextflow run eresearchqut/VirReport -profile {singularity, docker or conda} --indexfile index_example.csv --contamination_detection
or set them to true in the nextflow.config file.
params { contamination_detection = true }
Provide a database
By default, the pipeline is set to run homology blast searches against a local plant virus/viroid database (this is set in the nextflow.config file with parameter
--virreport_viral_db = true
. You will need to provide this database to run the pipeline. A curated database is provided at https://github.com/maelyg/PVirDB.git. Ensure you use NCBI BLAST+ makeblastdb to create the database. For instance, to set up this database, you would take the following steps:git clone https://github.com/maelyg/PVirDB.git cd PVirDB gunzip PVirDB_v1.fasta.gz makeblastdb -in PVirDB_v1.fasta -parse_seqids -dbtype nucl
Then specify the full path to the database files including the prefix in the nextflow.config file. For example:
params { blast_local_db_path = '/path_to_viral_DB/viral_DB_name' }
If you also want to run homology searches against public NCBI databases, you need to set the parameter
virreport_ncbi
in the nextflow.config file totrue
:params { virreport_ncbi = true }
or add it in your nextflow command:
nextflow run eresearchqut/VirReport -profile {docker, singularity or conda} --virreport_ncbi
Download these locally, following the detailed steps available at https://www.ncbi.nlm.nih.gov/books/NBK569850/ . Create a folder where you will store your NCBI databases. It is good practice to include the date of download. For instance:
mkdir blastDB/30112021
You will need to use the update_blastdb.pl script from the blast+ version used with the pipeline.
For example:perl update_blastdb.pl --decompress nt [*] perl update_blastdb.pl --decompress nr [*] perl update_blastdb.pl taxdb tar -xzf taxdb.tar.gz
Make sure the taxdb.btd and the taxdb.bti files are present in the same directory as your blast databases.
Specify the path of your local NCBI blast nt and nr directories in the nextflow.config file.
For instance:params { blast_db_dir = '/work/hia_mt18005_db/blastDB/20220408' }
Fastq files
You can either provide raw or pre-filtered fastq files to the pipeline.
By default the pipeline expects a single quality-filtered fastq file per sample.
If you want to provide raw fastq files, samples have to be specifically prepared with the QIAGEN QIAseq miRNA library kit. If you want to run the initial quality filtering step on your raw fastq files, you will need to set the
--qualityfilter
paramater totrue
in the config.file and specify the path to the directory which holds the required bowtie indices (using the--bowtie_db_dir
parameter) to: 1) filter non-informative reads (using the blacklist bowtie indices for the DERIVE_USABLE_READS process) and 2) optionally derive the origin of the filtered reads obtained (RNA_SOURCE_PROFILE process).The required fasta files are available at https://github.com/maelyg/bowtie_indices.git and bowtie indices can be built from these using the command:
git clone https://github.com/maelyg/bowtie_indices.git gunzip blacklist_v2.fasta.gz #you might need to activate your environment cached in either your conda or singularity environment in order to run bowtie #for example conda activate /path_to_cached_environment/virreport-77d02f3abe1d8ba5f8dfdff194142de9 #then run the bowtie command bowtie-build -f blacklist_v2.fasta blasklist
The directory in which the bowtie indices are located will need to be specified in the nextflow.config file:
params { bowtie_db_dir = '/path_to_bowtie_idx_directory' }
If you are interested to derive an RNA source profile of your fastq files you will need to specify:
params { rna_source_profile = true }
And build the other indices from the fasta files included in https://github.com/maelyg/bowtie_indices.git (i.e. rRNA, plant_tRNA, plant_noncoding, plant_pt_mt_other_genes, artefacts, miRNA, virus).
The quality filtering step will create the 00_quality_filtering folder under the results folder:
results/ ├── 00_quality_filtering └── sample_name │ ├── sample_name_18-25nt_cutadapt.log │ ├── sample_name_fastqc.html │ ├── sample_name_fastqc.zip │ ├── sample_name_21-22nt_cutadapt.log │ ├── sample_name_21-22nt.fastq.gz │ ├── sample_name_24nt_cutadapt.log │ ├── sample_name_blacklist_filter.log │ ├── sample_name_fastp.html │ ├── sample_name_fastp.json │ ├── sample_name_qual_filtering_cutadapt.log │ ├── sample_name_quality_trimmed_fastqc.html │ ├── sample_name_quality_trimmed_fastqc.zip │ ├── sample_name_quality_trimmed.fastq.gz │ ├── sample_name_read_length_dist.pdf │ ├── sample_name_read_length_dist.txt │ ├── sample_name_truseq_adapter_cutadapt.log │ └── sample_name_umi_tools.log └── qc_report ├── read_origin_counts.txt ├── read_origin_detailed_pc.txt ├── read_origin_pc_summary.txt ├── run_qc_report.txt └── run_read_size_distribution.pdf
If your sequencing run was split on multiple lanes, you might have several raw fastq files per sample, and you can directly feed these to the pipeline and specify the
--merge-lane
parameter. The fastq files will be collapsed to one fastq file before performing downstream analysis. The sample name used will be thesampleid
provided in the index.csv file. In the example below 2 fastq files were generated for 1 sample named CT103:
sampleid,samplepath CT103,/path_to_fastq_files_directory/CT_103_S10_L001_R1_001.fastq.gz CT103,/path_to_fastq_files_directory/CT_103_S10_L002_R1_001.fastq.gz
Running homology searches against viral database PVirDB
nextflow run eresearchqut/VirReport -profile singularity -resume --indexfile index.csv \ --merge_lane --qualityfilter --rna_source_profile \ --bowtie_db_dir /path_to_bowtie_indices \ --virreport_viral_db --blast_viral_db_path /path_to_local_viral_database
Running homology searches against NCBI NT/NR database
nextflow run eresearchqut/VirReport -profile singularity -resume --indexfile index.csv \ --merge_lane --qualityfilter --rna_source_profile \ --bowtie_db_dir /path_to_bowtie_indices \ --virreport_ncbi --blast_db_dir /path_to_ncbi_databases
Deduplicate reads using unique molecular identifiers and mapping coordinates
If you used the QIAGEN QIAseq miRNA library kit for cDNA library preparation, your reads will include unique molecular identifiers (UMIs) at their 5' end. These are automatically extracted from reads and incorporated at the end of the read name when the pipeline is run using the --qualityfilter
parameter.
If you want to deduplicate reads from the BAM files that are derived at the alignment step, you will need to specify the --dedup
parameter. UMI extraction and deduplication are performed with the tool umitools.
Summarising detections and flagging potential contamination events
If you want to derive a summary of detections for all the samples included in the index file, specify the --detecion_reporting_viral_db
or the --detection_reporting_nt
option. This will create a summary text file under the Summary tab with a column called contamination_flag
With the contamination flag, the assumption is that if a pest is present at high titer in a given sample and detection of reads matching to this pathogen in other samples occur at a significantly lower abundance, there is a risk that this lower signal is due to contamination (e.g. index hopping from high-titer sample). We first calculate the maximum FPKM value recorded for each virus and viroid identified on a run. If for a given virus, the FPKM value reported for a sample represented less than a percentage of this maximum FPKM value, it is then flagged as a contamination event. We apply a 1% threshold value as default. This is just indicative and method cannot discriminate between false positives and viruses present at very low titer in a plant. It is then recommended to compare the sequences obtained, check the SNPs and validate through independent method.
Running in diagnostic mode (SSG team only internal use)
If you want to run VirReport in diagnostics mode (--diagno
), the pipeline will also add an evidence category (ie KNOWN, KNOWN_FRAGMENT and CANDIDATE_NOVEL) to each detection based on av-pident and % bases 10X.
If you are running homology searches against the NCBI NT database, you will also need to provide a list of pests of interest in the Targetted_Virus_Viroid.txt file located in the bin folder. If some of the detections match to this pest list, they will be categorised as Quarantinable versus Higher_plant_viruses in the final summary.
Finally, if sample information is provided (--sampleinfo --sampleinfo_path /path/to/sampleinfo.txt and --samplesheet_path /path/to/Sample_Sheet.csv), this will be added to the final summary.
Sampleinfo.txt file example:
Sample PEQ_index_number LIMS_ID_RAMACIOTTI Host_species Host_common_name Plant_tissue_collected MT498 P30 ELL110002A1 Allium sativum Garlic 50 mg leaf
Running VirusDetect
VirusDetect version 1.8 can also be run in parallel.
See http://virusdetect.feilab.net/cgi-bin/virusdetect/index.cgi for details about this separate pipeline.
Example of PBS script to run on an HPC with torque batch system
Make sure to either specify the full path to your index.csv file in the PBS script or place a copy of the index.csv file in the folder you will run the PBS script in.
Example 1:
The PBS script example below (VirReport_nextflow.sh) will run on raw fastq files that will need to be merged and then quality filtered.
We are also asking to run a process that will derive an RNA source profile for each samples during the quality filtering step.
Blastn (using the megablast algorithm) and tblastn homology searches will be run against the PVirDB.
Finally we will want the reads to be de-duplicated after mapping.
#!/bin/bash -l #PBS -N VirReport #PBS -l select=1:ncpus=2:mem=8gb #PBS -l walltime=05:00:00 cd $PBS_O_WORKDIR module load java NXF_OPTS='-Xms1g -Xmx4g' nextflow run eresearchqut/VirReport -profile singularity -resume --indexfile index.csv \ --merge_lane --qualityfilter --rna_source_profile \ --bowtie_db_dir /path_to_bowtie_indices \ --dedup \ --virreport_viral_db --blast_viral_db_path /path_to_local_viral_database \ --detecion_reporting_viral_db
Example 2:
In the PBS job below, blastn (using the megablast algorithm) and blastx homology searches will be run against NCBI NR and NT respectively.
#!/bin/bash -l #PBS -N VirReport #PBS -l select=1:ncpus=2:mem=8gb #PBS -l walltime=05:00:00 cd $PBS_O_WORKDIR module load java NXF_OPTS='-Xms1g -Xmx4g' nextflow run eresearchqut/VirReport -profile singularity -resume --indexfile index.csv \ --merge_lane --qualityfilter --rna_source_profile \ --bowtie_db_dir /path_to_bowtie_indices \ --dedup \ --virreport_ncbi --blast_viral_db_path /path_to_ncbi_databases \ --detection_reporting_nt
Example 3:
In the PBS job below, homology searches will be run against NCBI and the PVirDB. The pipeline will also run VirusDetect in parallel.
#!/bin/bash -l #PBS -N VirReport #PBS -l select=1:ncpus=2:mem=8gb #PBS -l walltime=05:00:00 cd $PBS_O_WORKDIR module load java NXF_OPTS='-Xms1g -Xmx4g' nextflow run eresearchqut/VirReport -profile singularity -resume --indexfile index.csv \ --merge_lane --qualityfilter --rna_source_profile \ --bowtie_db_dir /path_to_bowtie_indices \ --dedup \ --virreport_ncbi --blast_viral_db_path /path_to_ncbi_databases --detecion_reporting_viral_nt \ --virreport_viral_db --blast_viral_db_path /path_to_local_viral_database --detection_reporting_viral_db \ --virusdetect --virusdetect_db_path sampleid,samplepath MT500,/work/hia_mt18005/raw_data/20220915_RAMACIOTTI_ELL11002_LEL11109/ELL11002/ELL11002A3/MT500_S3_L001_R1_001.fastq.gz MT500,/work/hia_mt18005/raw_data/20220915_RAMACIOTTI_ELL11002_LEL11109/ELL11002/ELL11002A3/MT500_S3_L002_R1_001.fastq.gz MT502,/work/hia_mt18005/raw_data/20220915_RAMACIOTTI_ELL11002_LEL11109/ELL11002/ELL11002A5/MT502_S5_L001_R1_001.fastq.gz MT502,/work/hia_mt18005/raw_data/20220915_RAMACIOTTI_ELL11002_LEL11109/ELL11002/ELL11002A5/MT502_S5_L002_R1_001.fastq.gz MT512,/work/hia_mt18005/raw_data/20220915_RAMACIOTTI_ELL11002_LEL11109/LEL11109/LEL11109A1/Fn1_S25_L001_R1_001.fastq.gz MT512,/work/hia_mt18005/raw_data/20220915_RAMACIOTTI_ELL11002_LEL11109/LEL11109/LEL11109A1/Fn1_S25_L002_R1_001.fastq.gz MT524,/work/hia_mt18005/raw_data/20220915_RAMACIOTTI_ELL11002_LEL11109/LEL11109/LEL11109A13/FraD3_S28_L001_R1_001.fastq.gz MT524,/work/hia_mt18005/raw_data/20220915_RAMACIOTTI_ELL11002_LEL11109/LEL11109/LEL11109A13/FraD3_S28_L002_R1_001.fastq.gz CT113,/work/hia_mt18005/raw_data/20220629_RAMACIOTTI_DES10730/DES10730A20/CT_113_S20_L001_R1_001.fastq.gz CT113,/work/hia_mt18005/raw_data/20220629_RAMACIOTTI_DES10730/DES10730A20/CT_113_S20_L002_R1_001.fastq.gz CT140,/work/hia_mt18005/raw_data/20220629_RAMACIOTTI_DES10730/DES10730A47/CT_140_S47_L001_R1_001.fastq.gz CT140,/work/hia_mt18005/raw_data/20220629_RAMACIOTTI_DES10730/DES10730A47/CT_140_S47_L002_R1_001.fastq.gz MT515,/work/hia_mt18005/raw_data/20220915_RAMACIOTTI_ELL11002_LEL11109/LEL11109/LEL11109A4/Cn1_S19_L001_R1_001.fastq.gz MT515,/work/hia_mt18005/raw_data/20220915_RAMACIOTTI_ELL11002_LEL11109/LEL11109/LEL11109A4/Cn1_S19_L002_R1_001.fastq.gz MT005,/work/hia_mt18005/raw_data/20210618_RAMACIOTTI_ELL9278/ELL9278/ELL9278A04/MT005_S4_L001_R1_001.fastq.gz MT005,/work/hia_mt18005/raw_data/20210618_RAMACIOTTI_ELL9278/ELL9278/ELL9278A04/MT005_S4_L002_R1_001.fastq.gz 2223PEQ041,/work/hia_mt18005/raw_data/20221018_RAMACIOTTI_LEL11294/LEL11294/LEL11294A15/2223PEQ041_S15_L001_R1_001.fastq.gz 2223PEQ041,/work/hia_mt18005/raw_data/20221018_RAMACIOTTI_LEL11294/LEL11294/LEL11294A15/2223PEQ041_S15_L002_R1_001.fastq.gz MT447,/work/hia_mt18005/raw_data/20220218_RAMACIOTTI_LEL10024/MT447_S40_L001_R1_001.fastq.gz MT447,/work/hia_mt18005/raw_data/20220218_RAMACIOTTI_LEL10024/MT447_S40_L002_R1_001.fastq.gz MT449,/work/hia_mt18005/raw_data/20220218_RAMACIOTTI_LEL10024/MT449_S33_L001_R1_001.fastq.gz MT449,/work/hia_mt18005/raw_data/20220218_RAMACIOTTI_LEL10024/MT449_S33_L002_R1_001.fastq.gz 2223PEQ012,/work/hia_mt18005/raw_data/20221018_RAMACIOTTI_LEL11291/LEL11291/LEL11291A12/2223PEQ012_S12_L001_R1_001.fastq.gz 2223PEQ012,/work/hia_mt18005/raw_data/20221018_RAMACIOTTI_LEL11291/LEL11291/LEL11291A12/2223PEQ012_S12_L002_R1_001.fastq.gz
Submit your job using the qsub command:
qsub VirReport_nextflow.sh
You can monitor your jobs using the command:
qstat -u $USER
Alternatively use the following command to check on the jobs you are running:
qjobs
You can also check the .nextflow.log file for details on progress.
Finally, if you have configured the connection to the NFTower you can logon and check your run.
5. Outputs
5A. Nextflow folder structure:
-Work folder where the intermediate results are kept so you can easily resume execution from the last successfully executed step. This can be deleted once analysis is finalised
-Results folder where all the generated data files to be kept are saved. The pipeline will populate outputs under separate folders for each step. These will be stored in subfolders for each sample.
5B. VirReport results folder structure
Under the Results folder, the folders are structured as follows:
results/ ├── 00_quality_filtering │ └── sample_name │ │ ├── sample_name_18-25nt_cutadapt.log │ │ ├── sample_name_fastqc.html │ │ ├── sample_name_fastqc.zip │ │ ├── sample_name_21-22nt_cutadapt.log │ │ ├── sample_name_21-22nt.fastq.gz │ │ ├── sample_name_24nt_cutadapt.log │ │ ├── sample_name_blacklist_filter.log │ │ ├── sample_name_fastp.html │ │ ├── sample_name_fastp.json │ │ ├── sample_name_qual_filtering_cutadapt.log │ │ ├── sample_name_quality_trimmed_fastqc.html │ │ ├── sample_name_quality_trimmed_fastqc.zip │ │ ├── sample_name_quality_trimmed.fastq.gz │ │ ├── sample_name_read_length_dist.pdf │ │ ├── sample_name_read_length_dist.txt │ │ ├── sample_name_truseq_adapter_cutadapt.log │ │ └── sample_name_umi_tools.log │ └── qc_report │ ├── read_origin_counts.txt │ ├── read_origin_detailed_pc.txt │ ├── read_origin_pc_summary.txt │ ├── read_origin_pc_summary.txt │ ├── run_qc_report.txt │ └── run_read_size_distribution.pdf ├── 01_VirReport │ └── sample_name │ │ └── alignments │ │ │ └──NT │ │ │ │ ├── sample_name_21-22nt_all_targets_with_scores.txt │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name_bowtie_log.txt │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name.consensus.fasta │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name.dedup.bam │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name.dedup.bam.bai │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name.fa │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name.fa.fai │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name_norm.bcf │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name_norm.bcf.csi │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name_norm_flt_indels.bcf │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name_norm_flt_indels.bcf.csi │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name_picard_metrics.txt │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name_sequence_variants.vcf.gz │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name_sequence_variants.vcf.gz.csi │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name_umi_tools.log │ │ │ │ ├── sample_name_21-22nt_top_scoring_targets.txt │ │ │ │ └── sample_name_21-22nt_top_scoring_targets_with_cov_stats.txt │ │ │ └──viral_db │ │ │ │ ├── sample_name_21-22nt_all_targets_with_scores.txt │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name_bowtie_log.txt │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name.consensus.fasta │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name.dedup.bam │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name.dedup.bam.bai │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name.fa │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name.fa.fai │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name_norm.bcf │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name_norm.bcf.csi │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name_norm_flt_indels.bcf │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name_norm_flt_indels.bcf.csi │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name_picard_metrics.txt │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name_sequence_variants.vcf.gz │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name_sequence_variants.vcf.gz.csi │ │ │ │ ├── sample_name_21-22nt_GenBankID_virus_name_umi_tools.log │ │ │ │ └── sample_name_21-22nt_top_scoring_targets_with_cov_stats_viraldb.txt │ │ └── assembly │ │ │ ├── sample_name_cap3_21-22nt.fasta │ │ │ ├── sample_name_spades_assembly_21-22nt.fasta │ │ │ ├── sample_name_spades_log │ │ │ ├── sample_name_velvet_assembly_21-22nt.fasta │ │ │ └── sample_name_velvet_log │ │ └──blastn │ │ │ └── NT │ │ │ │ ├── sample_name_cap3_21-22nt_blastn_vs_NT.bls │ │ │ │ ├── sample_name_cap3_21-22nt_blastn_vs_NT_top5Hits.txt │ │ │ │ ├── sample_name_cap3_21-22nt_blastn_vs_NT_top5Hits_virus_viroids_final.txt │ │ │ │ ├── sample_name_cap3_21-22nt_blastn_vs_NT_top5Hits_virus_viroids_seq_ids_taxonomy.txt │ │ │ │ └── summary_sample_name_cap3_21-22nt_blastn_vs_NT_top5Hits_virus_viroids_final.txt │ │ │ └── viral_db │ │ │ ├── sample_name_cap3_21-22nt_blastn_vs_viral_db.bls │ │ │ ├── sample_name_cap3_21-22nt_megablast_vs_viral_db.bls │ │ │ ├── summary_sample_name_cap3_21-22nt_blastn_vs_viral_db.bls_filtered.txt │ │ │ ├── summary_sample_name_cap3_21-22nt_blastn_vs_viral_db.bls_viruses_viroids_ICTV.txt │ │ │ ├── summary_sample_name_cap3_21-22nt_megablast_vs_viral_db.bls_filtered.txt │ │ │ └── summary_sample_name_cap3_21-22nt_megablast_vs_viral_db.bls_viruses_viroids_ICTV.txt │ │ ├── blastx │ │ │ └── NT │ │ │ ├── sample_name_cap3_21-22nt_blastx_vs_NT.bls │ │ │ ├── sample_name_cap3_21-22nt_blastx_vs_NT_top5Hits.txt │ │ │ ├── sample_name_cap3_21-22nt_blastx_vs_NT_top5Hits_virus_viroids_final.txt │ │ │ └── summary_sample_name_cap3_21-22nt_blastx_vs_NT_top5Hits_virus_viroids_final.txt │ │ └── tblastn │ │ └── viral_db │ │ ├── sample_name_cap3_21-22nt_getorf.all.fasta │ │ ├── sample_name_cap3_21-22nt_getorf.all_tblastn_vs_viral_db_out.bls │ │ └── sample_name_cap3_21-22nt_getorf.all_tblastn_vs_viral_db_top5Hits_virus_viroids_final.txt │ └── Summary │ ├── run_top_scoring_targets_with_cov_stats_with_cont_flag_FPKM_0.01_21-22nt.txt │ └── run_top_scoring_targets_with_cov_stats_with_cont_flag_FPKM_0.01_21-22nt_viral_db.txt └── 02_VirusDetect └── sample_name │ ├── blastn.reference.fa │ ├── blastn_references │ ├── blastx.reference.fa │ ├── blastx_references │ ├── contig_sequences.blastn.fa │ ├── contig_sequences.blastx.fa │ ├── contig_sequences.fa │ ├── contig_sequences.undetermined.fa │ ├── sample_name_21-22nt.blastn.html │ ├── sample_name_21-22nt.blastn.sam │ ├── sample_name_21-22nt.blastn_spp.txt │ ├── sample_name_21-22nt.blastn.summary.filtered.txt │ ├── sample_name_21-22nt.blastn.summary.txt │ ├── sample_name_21-22nt.blastn.txt │ ├── sample_name_21-22nt.blastx.html │ ├── sample_name_21-22nt.blastx.sam │ ├── sample_name_21-22nt.blastx.summary.txt │ └── sample_name_21-22nt.blastx.txt └── Summary ├── run_summary_top_scoring_targets_virusdetect_21-22nt_filtered.txt └── run_summary_virusdetect_21-22nt.txt
Under the 00_quality_filtering folder:
◦ a folder is created for each sample which contains zipped quality filtered fastq files, associated QC files and logs
◦ under the QC_report folder, read size distribution pdf file and read RNA source pdf file are created. The folder also includes a run_qc_report text file
01_VirReport folder content:
For each sample:
assembly: results associated with de novo assembly
blastn: megablast results (NCBI NT or viral database PVirDB)
blastx: blastx results against NR
tblastn: tblastn results against viral database PVirDB
alignments: alignment against top reference hit and associated statistic derivation
Summary
Definitions of terms used in summary report:
sacc Accession number of best homology match recovered
av-pident Average per cent identity of all de novo assembled contigs to the same top reference hit
Mean read depth The mean coverage in bases to the genome/sequence of the best homology match
Dedup read count Read counts after PCR duplicates sharing UMIs are collapsed
Dup % Duplication rate detected using UMIs
FPKM: Fragments Per Kilobase of transcript, per Million mapped reads is a normalised unit of transcript expression. It scales by transcript length to compensate for the fact that most RNA-seq protocols will generate more sequencing reads from longer RNA molecules. The formula is: [deduplicated read count x 10^3 x 10^6]/[total quality filtered reads x genome length]
% bases 5X The fraction of bases that attained at least 5X sequence coverage
% bases 10X The fraction of bases that attained at least 10X sequence coverage
Contamination flag.