Installation
Install Nextflow
The VSD workflow requires Nextflow to be installed in your account on the HPC. Find details on how to install and test Nextflow here prepare a nextflow.config file and run a PBS pro submission script for Nextflow pipelines.
Pull the git repo using:
nextflow pull command then add -r vsd-1.0
Install conda3 or miniconda3
https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html
All the required modules are included in the environment.yml file in the pipeline base directory and shows all the tools used in the pipeline:
name: vsd-1.0 channels: - bioconda - conda-forge - defaults - r - anaconda dependencies: - blast=2.11.0 - cap3=10.2011 - spades=3.14.0 - emboss=6.6.0 - openjdk=8.0.152 - fastp=0.20.1 - biopython=1.76 - numpy=1.16.5 - matplotlib=2.2.3 - velvet=1.2.10 - bowtie=1.3.0 - samtools=1.12 - picard==2.25.6 - bedtools - bcftools - pandas - fastqc=0.11.9 - fastp=0.20.1 - cutadapt=3.5 - umi_tools=1.1.2
Install a local NCBI blast directory (NT and NR)
Find detailed infor on how to download these databases at https://www.ncbi.nlm.nih.gov/books/NBK569850/
Make sure the taxdb.btd and the taxdb.bti files are also present in the directory.
Create a folder where you will store your NCBI database including the date of download. For instance:
mkdir blastDB/30112021
Run the following PBS script in the newly created folder. Use the update_blastdb.pl
script from the blast+ version you will use with your pipeline.
#!/bin/bash -l #PBS -N blastdb_download #PBS -l walltime=24:00:00 #PBS -l mem=60gb #PBS -l ncpus=2 cd $PBS_O_WORKDIR perl update_blastdb.pl --decompress nt [*] perl update_blastdb.pl --decompress nr [*] perl update_blastdb.pl taxdb tar -xzf taxdb.tar.gz
The VSD workflow
The VSD workflow will perform the following steps by default:
Retain reads of a given length (e.g. 21-22 or 24 nt long) from fastq file(s) provided in index.csv file (
readprocessing
)De novo assembly using kmer 15 and coverage 3 (
velvet
)Collapse contigs into scaffolds (min length 20) (
cap3
)Run megablast homology search against NCBI NT database (
megablast_nt_velvet
)Summarise megablast results and restrict to virus and viroid matches (
BlastTools_megablast_velvet
)Derive coverage statistics, consensus sequence and VCF matching to top blast hits (
filter_n_cov
)
A number of additional optional steps can be run:
--blastp: Predict ORF from de novo assembly (derived with Velvet) and run blastP againts NCBI NR (
getorf
,blastp
,blastpdbcmd
,BlastToolsp
) --blastp--contamination_detection: Run cross-sample contamination prediction (
contamination_detection
)--blastlocaldb: Run blastn and megablast homology search on de novo assembly (derived with Velvet) against local virus and viroid database (
blast_nt_localdb_velvet
,filter_blast_nt_localdb_velvet
)--blastn: Run blastn homology search on de novo assembly (derived with Velvet) against local virus and viroid database (
blastn_nt_velvet
)--spades: Run SPAdes 3.14 de novo assembler and perform blastn homology analysis on the derived de novo contigs (
spades, cap3_spades
,megablast_nt_spades
,BlastToolsn_megablast_spades
)
A number of additional options are included:
--targets: A text file with the taxonomy of the viruses/virioids of interest can be provided and only these will be retained in the megablast summary results derived at the
filter_n_cov
step.--spadeskmer specifies the range of kmers to use when running spades
--cap3_len specifies the minimal length of contigs to retain after CAP3 scaffolding
--blastn_evalue and --blastp_evalue specifies the evalue parameter to use during blast analyses
--orf_minsize correspond to the minimal open reading frame getorf retains
To enable these options, they can either be included in the nextflow run command provided in the PBS script:
nextflow run $PBS_O_WORKDIR/main.nf -resume --indexfile $PBS_O_WORKDIR/index_example.csv --blastlocaldb --spades --contamination_detection
or update parameter to true in the nextflow.config file. For instance:
params { blastlocaldb = true spades = true contamination_detection = true }
Preparing the data
Preparing a index.csv file
You need to create a TAB delimited text file that will be the input for the workflow. By default the pipeline will look for a file called “index.csv” in the base directory but you can specify any file name using the --indexfile [filename]
in the nextflow run command. This text file requires the following columns (which needs to be included as a header): sampleid,samplepath,minlen,maxlen:
sampleid will be the sample name that will be given to the files created by the pipeline
samplepath is the full path to the quality filtered fastq files that the pipeline requires as starting input
minlen and maxlen correspond to the read size that will be retained for downstream analyses.
An index_example.csv is included in the base directory:
sampleid,samplepath,minlen,maxlen MT212,/work/hia_mt18005/diagnostics/2021/14_RAMACIOTTI_LEL9742-LEL9751/results/06_usable_reads/MT212_21-22bp.fastq,21,22 MT213,/work/hia_mt18005/diagnostics/2021/14_RAMACIOTTI_LEL9742-LEL9751/results/06_usable_reads/MT213_21-22bp.fastq,21,22
You also need to provide the path of your NCBI blast directory in the nextflow.config file. For instance:
params { blast_db = '/lustre/work-lustre/hia_mt18005/blastDB/30112021' }
If you are interested to run a blast analysis against a local database, you also need to specify its path in the nextflow.config file. For example:
blastn_local_db = '/work/hia_mt18005/databases/PVirDB/PVriDB_ver2021_11_09/PVirDB_ver20211109.fasta'
Running the pipeline
Finally you need to create a PBS script which includes your nextflow run command. An example of PBS script is included in the base directory and will run the pipeline with default steps:
#!/bin/bash -l #PBS -N NextflowVSD #PBS -l select=1:ncpus=2:mem=6gb #PBS -l walltime=24:00:00 cd $PBS_O_WORKDIR module load java NXF_OPTS='-Xms1g -Xmx4g' nextflow run $PBS_O_WORKDIR/main.nf -resume --indexfile $PBS_O_WORKDIR/index_example.csv
This following PBS script will additionally run spades, the cross sample predictor and blast searches against a local database:
#!/bin/bash -l #PBS -N NextflowVSD #PBS -l select=1:ncpus=2:mem=6gb #PBS -l walltime=24:00:00 #run spades #run blast homology against user local database #run false positive predictor nextflow run $PBS_O_WORKDIR/main.nf -resume --indexfile $PBS_O_WORKDIR/index_example.csv --blastlocaldb --spades --contamination_detection
These jobs can be submitted using:
qsub nextflow_example.pbs
Monitoring the run
You can use the command
qstat -u $USER
Alternatively use the following command:
qjobs
To check on the jobs you are running. Nextflow will launch additional jobs during the run.
You can also check the .nextflow.log file for details on what is going on.
Finally, if you have configured the connection to the NFTower you can logon and check your run.
Outputs
Under the results folder, the pipeline will populate outputs under separate folders for each step. These will be stored in subfolders for each sample:
results → 01_read_size_selection → sample1
→ sample2
→ sample 3
→ 02_velvet → sample1
→ sample2
→ sample 3
etc…
The folders are:
01_read_size_selection (cutadapt log file and fastq file including reads only matching the size specified in the index.csv file) MT020_21-22nt_cutadapt.log & MT020_21-22nt.fastq
02_velvet (velvet results and the fasta file which includes the velvet assembled contigs MT020_velvet_assembly_21-22nt.fasta
02a_spades (if spades is additionally run)
03_cap3 (fasta file of the scaffolds produced by CAP3 as well as the singletons) MT020_velvet_cap3_21-22nt_rename.fasta
04_blastn (all blastn results, filtered results limited to only viruses and viroid top 5 hit matches and their taxonomy) MT020_velvet_21-22nt_megablast_vs_NT.bls, MT020_velvet_21-22nt_megablast_vs_NT_top5Hits.txt, MT020_velvet_21-22nt_megablast_vs_NT_top5Hits_virus_viroids_final.txt MT020_velvet_21-22nt_megablast_vs_NT_top5Hits_virus_viroids_seq_ids_taxonomy.txt
05_blastoutputs (
BlastTools
.jar summary output which clusters all the contigs matching to a specific hit. summary_MT029_velvet_21-22nt_megablast_vs_NT_top5Hits_virus_viroids_final.txt06_blastp (blastp outputs)
07_filternstats (filtered blast summary with various coverage statistics for each virus and viroid hit, and associated consensus fasta file and vcf file) MT020_21-22nt_top_scoring_targets_with_cov_stats.txt, MT020_21-22nt_MK929590_Peach_latent_mosaic_viroid.consensus.fasta, MT020_21-22nt_MK929590_Peach_latent_mosaic_viroid_sequence_variants.vcf.gz
08_report (summary of results for all samples included in the index.csv file. This includes a cross-contamination prediction) run_top_scoring_targets_with_cov_stats_with_cont_flag_21-22nt_0.01.txt
To do:
Include a deduplication step for fastq files that have UMIs incorporated
Make QC filtering optional
Work on final report
Add coverage statistics to local db blast results
Incorporate VirusDetect in the pipeline