...
If you want to derive a summary of detections for all the samples included in the index file, specify the --contamination_detection_viral_db
or the --contamination_detection_ncbi
option. This will create a summary text file under the Summary tab with a column called contamination_flag
Running VirusDetect
VirusDetect version 1.8 can also be run in parallel.
See http://virusdetect.feilab.net/cgi-bin/virusdetect/index.cgi for details about this separate pipeline.
Example of PBS script to run on an HPC with torque batch system
Make sure to either specify the full path to your index.csv file in the PBS script or place a copy of the index.csv file in the folder you will run the PBS script in.
The PBS script example below (VirReport_nextflow.sh) will run on raw fastq files that will need to be merged and then quality filtered.
We are also asking to run a process that will derive an RNA source profile for each samples during the quality filtering step.
Homology searches will be run against NCBI and the PVirDB. The pipeline will also run VirusDetect in parallel.
Finally we will want the reads to be de-duplicated after mapping.
...
With the contamination flag, the assumption is that if a pest is present at high titer in a given sample and detection of reads matching to this pathogen in other samples occur at a significantly lower abundance, there is a risk that this lower signal is due to contamination (e.g. index hopping from high-titer sample). We first calculate the maximum FPKM value recorded for each virus and viroid identified on a run. If for a given virus, the FPKM value reported for a sample represented less than a percentage of this maximum FPKM value, it is then flagged as a contamination event. We apply 0.1% threshold value as default. This is just indicative and method cannot discriminate between false positives and viruses present at very low titer in a plant. It is then recommended to compare the sequences obtained, check the SNPs and validate through independent method.
Running in diagnostic mode (SSG team only internal use)
If you want to run VirReport in diagnostics mode (--diagno
), the pipeline will also add an evidence category (ie KNOWN, KNOWN_FRAGMENT and PUTATIVE_NOVEL) to each detection based on av-pident and % bases 10X.
If you are running homology searches against the NCBI NT database, you will also need to provide a list of pests of interest in the Targetted_Virus_Viroid.txt file located in the bin folder. If some of the detections match to this pest list, they will be categorised as Quarantinable versus Higher_plant_viruses in the final summary.
Finally, if sample information is provided (--sampleinfo --sampleinfo_path /path/to/sampleinfo.txt), this will be added to the final summary.
Sampleinfo.txt file example:
Code Block |
---|
Sample PEQ_index_number LIMS_ID_RAMACIOTTI Host_species Host_common_name Plant_tissue_collected
MT498 P30 ELL110002A1 Allium sativum Garlic 50 mg leaf |
Running VirusDetect
VirusDetect version 1.8 can also be run in parallel.
See http://virusdetect.feilab.net/cgi-bin/virusdetect/index.cgi for details about this separate pipeline.
Example of PBS script to run on an HPC with torque batch system
Make sure to either specify the full path to your index.csv file in the PBS script or place a copy of the index.csv file in the folder you will run the PBS script in.
Example 1:
The PBS script example below (VirReport_nextflow.sh) will run on raw fastq files that will need to be merged and then quality filtered.
We are also asking to run a process that will derive an RNA source profile for each samples during the quality filtering step.
Blastn (using the megablast algorithm) and tblastn homology searches will be run against the PVirDB.
Finally we will want the reads to be de-duplicated after mapping.
Code Block |
---|
#!/bin/bash -l
#PBS -N VirReport
#PBS -l select=1:ncpus=2:mem=8gb
#PBS -l walltime=05:00:00
cd $PBS_O_WORKDIR
module load java
NXF_OPTS='-Xms1g -Xmx4g'
nextflow run eresearchqut/VirReport -profile singularity -resume --indexfile index.csv \
--merge_lane --qualityfilter --rna_source_profile \
--bowtie_db_dir /path_to_bowtie_indices \
--dedup \
--virreport_viral_db --blast_viral_db_path /path_to_local_viral_database --contamination_detection_viral_db |
Example 2:
In the PBS job below, blastn (using the megablast algorithm) and blastx homology searches will be run against NCBI NR and NT respectively.
Code Block |
---|
#!/bin/bash -l
#PBS -N VirReport
#PBS -l select=1:ncpus=2:mem=8gb
#PBS -l walltime=05:00:00
cd $PBS_O_WORKDIR
module load java
NXF_OPTS='-Xms1g -Xmx4g'
nextflow run eresearchqut/VirReport -profile singularity -resume --indexfile index.csv \
--merge_lane --qualityfilter --rna_source_profile \
--bowtie_db_dir /path_to_bowtie_indices \
--dedup \
--virreport_ncbi --blast_viral_db_path /path_to_ncbi_databases --contamination_detection |
Example 3:
In the PBS job below, homology searches will be run against NCBI and the PVirDB. The pipeline will also run VirusDetect in parallel.
Code Block |
---|
#!/bin/bash -l #PBS -N VirReport #PBS -l select=1:ncpus=2:mem=8gb #PBS -l walltime=05:00:00 cd $PBS_O_WORKDIR module load java NXF_OPTS='-Xms1g -Xmx4g' nextflow run eresearchqut/VirReport -profile singularity -resume --indexfile index.csv \ --merge_lane --qualityfilter --rna_source_profile \ --bowtie_db_dir /path_to_bowtie_indices \ --dedup \ --virreport_ncbi --blast_viral_db_path /path_to_ncbi_databases --contamination_detection \ --virreport_viral_db --blast_viral_db_path /path_to_local_viral_database --contamination_detection_viral_db \ --virusdetect --virusdetect_db_path /path_to_virusdetect_database |
Submit your job using the qsub command:
...
Alternatively use the following command to check on the jobs you are running.:
qjobs
You can also check the .nextflow.log file for details on progress.
...
◦ under the QC_report folder, read size distribution pdf file and read RNA source pdf file are created. The folder also includes a run_qc_report text file
...
01_VirReport folder content:
For each sample:
assembly: results associated with de novo assembly
blastn: megablast results (NCBI NT or viral database PVirDB)
blastx: blastx results against NR
tblastn: tblastn results against viral database PVirDB
alignments: alignment against top reference hit and associated statistic derivation
Summary
...
Definitions of terms used in summary report:
...
sacc Accession number of best homology match recovered
...
av-pident Average per cent identity of all de novo assembled contigs to the same top reference hit
...
Mean read depth The mean coverage in bases to the genome/sequence of the best homology match
...
Dedup read count Read counts after PCR duplicates sharing UMIs are collapsed
...
Dup % Duplication rate detected using UMIs
...
FPKM: Fragments Per Kilobase of transcript, per Million mapped reads is a normalised unit of
...
transcript expression. It scales by transcript length to compensate for the fact that most
...
RNA-seq protocols will generate more sequencing reads from longer RNA molecules
...
[deduplicated read count x 10^3 x 10^6]/[total quality filtered reads x genome length]
...
% bases 5X The fraction of bases that attained at least 5X sequence coverage
...
% bases 10X The fraction of bases that attained at least 10X sequence coverage
...
file are created. The folder also includes a run_qc_report text file
...
01_VirReport folder content:
For each sample:
assembly: results associated with de novo assembly
blastn: megablast results (NCBI NT or viral database PVirDB)
blastx: blastx results against NR
tblastn: tblastn results against viral database PVirDB
alignments: alignment against top reference hit and associated statistic derivation
Summary
...
Definitions of terms used in summary report:
sacc Accession number of best homology match recovered
av-pident Average per cent identity of all de novo assembled contigs to the same top reference hit
Mean read depth The mean coverage in bases to the genome/sequence of the best homology match
Dedup read count Read counts after PCR duplicates sharing UMIs are collapsed
Dup % Duplication rate detected using UMIs
FPKM: Fragments Per Kilobase of transcript, per Million mapped reads is a normalised unit of transcript expression. It scales by transcript length to compensate for the fact that most RNA-seq protocols will generate more sequencing reads from longer RNA molecules. The formula is: [deduplicated read count x 10^3 x 10^6]/[total quality filtered reads x genome length]
% bases 5X The fraction of bases that attained at least 5X sequence coverage
% bases 10X The fraction of bases that attained at least 10X sequence coverage
Contamination flag.