Aims:
Implement an end-to-end bioinformatics workflow that is reproducible, robust, scalable and compute infrastructure agnostic
Leverage from the host plant antiviral response pathway to increase sensitivity and specificity of pathogen detections
Prevent or minimise the reporting of cross-sample contaminations owing to index hopping events (false positive detections)
Pre-requisites
Installed conda3 or miniconda3 ( https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html )
Basic unix command line knowledge (example: https://researchcomputing.princeton.edu/education/external-online-resources/linux ; https://swcarpentry.github.io/shell-novice/ )
Familiarity with one unix text editors (example Vi/Vim or Nano):
Have an HPC account on QUT’s lyra. Apply for a new HPC account here.
Install nextflow: Nextflow
Database
Custom virus database, please do not distribute to third parties. Location:
/work/img/databases/
Creating a local blast database
makeblastdb -in test.fasta -parse_seqids -dbtype nucl
Method
We will use two nextflow pipelines to process the Virome data, initially, we run trimgalore to filter out poor quality reads/bases and remove adapter sequences. Then we run VirReport to assess the presence of viruses and viroids.
1) Quality Control of Raw Files
First generate an ‘index.csv’ file that contains the Sample ID and path to the raw data file:
sampleId,read1 CB,/work/img/raw_data/AGRF_CAGRF22029755_H52LJDRX2/CB_H52LJDRX2_TCATGCGT_L001_R1.fastq.gz CM,/work/img/raw_data/AGRF_CAGRF22029755_H52LJDRX2/CM_H52LJDRX2_CTGCATCA_L001_R1.fastq.gz CP,/work/img/raw_data/AGRF_CAGRF22029755_H52LJDRX2/CP_H52LJDRX2_TCAGACTT_L001_R1.fastq.gz TB1,/work/img/raw_data/AGRF_CAGRF22029755_H52LJDRX2/TB1_H52LJDRX2_TCACTACG_L001_R1.fastq.gz TBG,/work/img/raw_data/AGRF_CAGRF22029755_H52LJDRX2/TBG_H52LJDRX2_CTTCACGA_L001_R1.fastq.gz TM,/work/img/raw_data/AGRF_CAGRF22029755_H52LJDRX2/TM_H52LJDRX2_CGTTCTGC_L001_R1.fastq.gz TP,/work/img/raw_data/AGRF_CAGRF22029755_H52LJDRX2/TP_H52LJDRX2_AAGTTATC_L001_R1.fastq.gz TPS,/work/img/raw_data/AGRF_CAGRF22029755_H52LJDRX2/TPS_H52LJDRX2_CTTCTTAA_L001_R1.fastq.gz TR1,/work/img/raw_data/AGRF_CAGRF22029755_H52LJDRX2/TR1_H52LJDRX2_TCAGTGAG_L001_R1.fastq.gz TR2,/work/img/raw_data/AGRF_CAGRF22029755_H52LJDRX2/TR2_H52LJDRX2_TGACCGCG_L001_R1.fastq.gz
Create a PBS Pro submission script:
#!/bin/bash -l #PBS -N nftrimgalore #PBS -l walltime=24:00:00 #PBS -l select=1:ncpus=1:mem=5gb cd $PBS_O_WORKDIR NXF_OPTS='-Xms1g -Xmx4g' module load java #run netflow pipeline nextflow run trimgalore --indexfile index.csv --singleEnd --trim_qual 30
Submit the job to the HPC scheduler:
qsub launch.pbs
Check progress of the job:
qjobs
qstat -u USERNAME
2) Diagnosis of plant viruses and viroids
Installing VirReport
The open-source VirReport code is available at https://github.com/eresearchqut/VirReport
At the HPC, run the following command to get a copy of the source code:
git clone https://github.com/eresearchqut/VirReport.git
Alternatively, run the following command to fetch and also test VirReport:
nextflow run eresearchqut/VirReport -profile singularity --indexfile index_example.csv
Note: the above command will store a cached copy of VirReport at '$HOME/.nextflow/NXF_SINGULARITY_CACHEDIR'
Running VirReport
Sample index file
To run VirReport it is required to create an 'index_samples.csv` that specifies the sample ID, path to raw data, minimal length, and the maximum length of reads to be used for diagnosis. For example:
sampleid,samplepath,minlen,maxlen MT212,/work/hia_mt18005/diagnostics/2021/14_RAMACIOTTI_LEL9742-LEL9751/results/06_usable_reads/MT212_21-22bp.fastq,21,22 MT213,/work/hia_mt18005/diagnostics/2021/14_RAMACIOTTI_LEL9742-LEL9751/results/06_usable_reads/MT213_21-22bp.fastq,21,22
You can modify the above template with your own samples. Note, the files above can be the trimgalore processed files.
sampleid,samplepath,minlen,maxlen CB,/work/img/test/trimgalore/results/Trim_Galore/CB_trimmed.fq.gz,21,22 CM,/work/img/test/trimgalore/results/Trim_Galore/CM_trimmed.fq.gz,21,22 CP,/work/img/test/trimgalore/results/Trim_Galore/CP_trimmed.fq.gz,21,22 TB1,/work/img/test/trimgalore/results/Trim_Galore/TB1_trimmed.fq.gz,21,22 TBG,/work/img/test/trimgalore/results/Trim_Galore/TBG_trimmed.fq.gz,21,22 TM,/work/img/test/trimgalore/results/Trim_Galore/TM_trimmed.fq.gz,21,22 TPS,/work/img/test/trimgalore/results/Trim_Galore/TPS_trimmed.fq.gz,21,22 TP,/work/img/test/trimgalore/results/Trim_Galore/TP_trimmed.fq.gz,21,22 TR1,/work/img/test/trimgalore/results/Trim_Galore/TR1_trimmed.fq.gz,21,22 TR2,/work/img/test/trimgalore/results/Trim_Galore/TR2_trimmed.fq.gz,21,22
2. Run VirReport using a PBS Pro script
Define nextflow configurations if different from provided template:
includeConfig 'conf/base.config' params { outdir = 'results' indexfile = 'index.csv' blast_db_dir = '/lustre/work-lustre/hia_mt18005/blastDB/30112021' blast_local_db_path = '/work/img/databases/PVirDB/PVirDB_ver20211109.fasta' targets = false targets_file = 'Targetted_Viruses_Viroids.txt' help = false cap3_len = '20' orf_minsize = '150' orf_circ_minsize = '150' blastn_evalue = '0.0001' blastp_evalue = '0.0001' blastn_method = 'megablast' blastp = false spades = false spadeskmer = '9 11 13 15 17 19 21' blastlocaldb = false ictvinfo = 'ICTV_taxonomy_MinIdentity_Species.tsv' contamination_detection = false contamination_flag = '0.01' contamination_detection_method = 'FPKM' } process.container = "ghcr.io/eresearchqut/virreport:v1.0.0" manifest { name = "eresearchqut/VirReport" author = "Roberto Barrero, Maely Gauthier, Desmond Schmidt, Craig Windell" defaultBranch = "main" description = "VirReport is designed to help phytosanitary diagnostics of viruses and viroid pathogens in quarantine facilities. It takes small RNA-Seq samples as input." version = "v1.0.0" }
Prepare a PBS Pro submission script:
#!/bin/bash -l #PBS -N nftrimgalore #PBS -l walltime=24:00:00 #PBS -l select=1:ncpus=1:mem=5gb cd $PBS_O_WORKDIR NXF_OPTS='-Xms1g -Xmx4g' module load java #run netflow pipeline nextflow run eresearchqut/VirReport -profile singularity --indexfile index.csv