...
Searches against local NCBI NT and NR databases:
Retain top 5 megablast hits and restrict results to virus and viroid matches. Summarise results by grouping all the de novo contigs matching to the same viral hit and deriving the cumulative blast coverage and percent ID for each viral hit (BLATNBLASTN_NT_CAP3)
Align reads to top hit, derive coverage statistics, consensus sequence and VCF matching to top blast hits (COVSTATS_NT)
Run blastx homolgy search on contigs >= 90 bp long for which no match was obtained in the megablast search. Summarise the blastx results and restrict to virus and viroid matches (BLASTX)
A quality filtering step on raw fastq files (currently the workflow only processes samples prepared using QIAGEN QIAseq miRNA library kit). After performing quality filtering (FASTQC_RAW, ADAPTER_AND_QUAL_TRIMMING, QC_POST_QUAL_TRIMMING, DERIVE_USABLE_READS). the pipeline will also derive a qc report (QCREPORT). An RNA souce profile can be included as part of this step (RNA_SOURCE_PROFILE, RNA_SOURCE_PROFILE_REPORT)
VirusDetect version 1.8 can also be run in parallel. A summary of the top virus/viroid blastn hits will be separately output (VIRUS_DETECT, VIRUS_IDENTIFY, VIRUS_DETECT_BLASTN_SUMMARY, VIRUS_DETECT_BLASTN_SUMMARY_FILTERED)
...
If you also want to run homology searches against public NCBI databases, you need to set the parameter
virreport_ncbi
in the nextflow.config file totrue
:Code Block params { virreport_ncbi = true }
or add it in your nextflow command:
Code Block nextflow run eresearchqut/VirReport -profile {docker, singularity or conda} --virreport_ncbi
Download these locally, following the detailed steps available at https://www.ncbi.nlm.nih.gov/books/NBK569850/ . Create a folder where you will store your NCBI databases. It is good practice to include the date of download. For instance:
Code Block mkdir blastDB/30112021
You will need to use the update_blastdb.pl script from the blast+ version used with the pipeline.
For example:Code Block perl update_blastdb.pl --decompress nt [*] perl update_blastdb.pl --decompress nr [*] perl update_blastdb.pl taxdb tar -xzf taxdb.tar.gz
Make sure the taxdb.btd and the taxdb.bti files are present in the same directory as your blast databases.
Specify the path of your local NCBI blast nt and nr directories in the nextflow.config file.
For instance:Code Block params { blast_db_dir = '/work/hia_mt18005_db/blastDB/20220408' }
...
Fastq files
You can either provide raw or pre-filtered fastq files to the pipeline.
By default the pipeline expects a single quality-filtered fastq file per sample.
If you want to provide raw fastq files, samples have to be specifically prepared with the QIAGEN QIAseq miRNA library kit.
...
If you want to run the initial quality filtering step on your raw fastq files, you will need to set the
--qualityfilter
paramater totrue
in the config.file and specify the path to the directory which holds the required bowtie indices (using the--bowtie_db_dir
parameter) to: 1) filter non-informative reads (using the blacklist bowtie indices for the DERIVE_USABLE_READS process) and 2) optionally derive the origin of the filtered reads obtained (RNA_SOURCE_PROFILE process).The required fasta files are available at https://github.com/maelyg/bowtie_indices.git and bowtie indices can be built from these using the command:
Code Block git clone https://github.com/maelyg/bowtie_indices.git gunzip blacklist_v2.fasta.gz #you might need to activate your environment cached in either your conda or singularity environment in order to run bowtie #for example conda activate /path_to_cached_environment/virreport-77d02f3abe1d8ba5f8dfdff194142de9 #then run the bowtie command bowtie-build -f blacklist_v2.fasta blasklist
The location of the directory in which the bowtie indices are located will need to be specified in the nextflow.config file:
Code Block params { bowtie_db_dir = '/path_to_bowtie_idx_directory' }
If you are interested to derive an RNA source profile of your fastq files you will need to specify:
Code Block params { rna_source_profile = true }
And build the other indices from the fasta files included in https://github.com/maelyg/bowtie_indices.git (i.e. rRNA, plant_tRNA, plant_noncoding, plant_pt_mt_other_genes, artefacts, plant_miRNA, virus).
The quality filtering step will create the 00_quality_filtering folder under the results folder:
Code Block results/ ├── 00_quality_filtering └── sample_name │ ├── sample_name_18-25nt_cutadapt.log │ ├── sample_name_fastqc.html │ ├── sample_name_fastqc.zip │ ├── sample_name_21-22nt_cutadapt.log │ ├── sample_name_21-22nt.fastq.gz │ ├── sample_name_24nt_cutadapt.log │ ├── sample_name_blacklist_filter.log │ ├── sample_name_fastp.html │ ├── sample_name_fastp.json │ ├── sample_name_qual_filtering_cutadapt.log │ ├── sample_name_quality_trimmed_fastqc.html │ ├── sample_name_quality_trimmed_fastqc.zip │ ├── sample_name_quality_trimmed.fastq.gz │ ├── sample_name_read_length_dist.pdf │ ├── sample_name_read_length_dist.txt │ ├── sample_name_truseq_adapter_cutadapt.log │ └── sample_name_umi_tools.log └── qc_report ├── read_origin_counts.txt ├── read_origin_detailed_pc.txt ├── read_origin_pc_summary.txt ├── read_origin_pc_summary.txt ├── run_qc_report.txt └── run_read_size_distribution.pdf
If your sequencing run was split between several on multiple lanes, you might have several raw fastq files per sample, and you can directly feed these to the pipeline and specify the
--merge-lane
parameter. The fastq files will be collapsed to one fastq file before performing downstream analysis. The sample name used will be thesampleid
provided in the index.csv file. In the example below 2 fastq files were generated for 1 sample named CT103:
Code Block |
---|
sampleid,samplepath CT103,/path_to_fastq_files_directory/CT_103_S10_L001_R1_001.fastq.gz CT103,/rpathpath_to_fastq_files_directory/CT_103_S10_L002_R1_001.fastq.gz |
Deduplicate reads using unique molecular identifiers and mapping coordinates
If you used the QIAGEN QIAseq miRNA library kit for cDNA library preparation, your reads will include unique molecular identifiers (UMIs) at their 5' end. These are automatically extracted from reads and incorporated at the end of the read name when the pipeline is run using the --qualityfilter
parameter.
If you want to deduplicate reads from the BAM files that are derived at the alignment step, you will need to specify the --dedup
parameter. UMI extraction and deduplication are performed with the tool umitools.
Summarising detections and flagging potential contamination events
If you want to derive a summary of detections for all the samples included in the index file, specify the --contamination_detection_viral_db
or the --contamination_detection_ncbi
option. This will create a summary text file under the Summary tab with a column called contamination_flag.