...
Retain reads of a given length (e.g. 21-22 or 24 nt long) from fastq file(s) provided in index.csv file (
readprocessing
)De novo assembly using kmer 15 and coverage 3 (
velvet
) -Collapse contigs into scaffolds (min length 20) (
cap3
)Run megablast homology search against NCBI NT database (
megablast_nt_velvet
)Summarise megablast results and restrict to virus and viroid matches (
BlastTools_megablast_velvet
)Derive coverage statistics, consensus sequence and VCF matching to top blast hits (
filter_n_cov
)
...
→ sample 3
etc…
The folders are structures as follows (examples of outputs are provided in italics):
01_read_size_selection (cutadapt log file and fastq file including reads only matching the size specified in the index.csv file) MT020_21-22nt_cutadapt.log & MT020_21-22nt.fastq
02_velvet (velvet results and the fasta file which includes the velvet assembled contigs MT020_velvet_assembly_21-22nt.fasta
02a_spades (if spades is additionally run)
03_cap3 (fasta file of the scaffolds produced by CAP3 as well as the singletons) MT020_velvet_cap3_21-22nt_rename.fasta
04_blastn (all blastn results, filtered results limited to only viruses and viroid top 5 hit matches and their taxonomy) MT020_velvet_21-22nt_megablast_vs_NT.bls, MT020_velvet_21-22nt_megablast_vs_NT_top5Hits.txt, MT020_velvet_21-22nt_megablast_vs_NT_top5Hits_virus_viroids_final.txt MT020_velvet_21-22nt_megablast_vs_NT_top5Hits_virus_viroids_seq_ids_taxonomy.txt
05_blastoutputs (
BlastTools
.jar summary output which clusters all the contigs matching to a specific hit. summary_MT029_velvet_21-22nt_megablast_vs_NT_top5Hits_virus_viroids_final.txt06_blastp (blastp outputs) MT020_velvet_21-22nt_getorf.min50aa.fasta, MT020_velvet_21-22nt_getorf.min50aa_blastp_vs_NR_out_virus_viroid.txt
07_filternstats (filtered blast summary with various coverage statistics for each virus and viroid hit, and associated consensus fasta file and vcf file) MT020_21-22nt_top_scoring_targets_with_cov_stats.txt, MT020_21-22nt_MK929590_Peach_latent_mosaic_viroid.consensus.fasta, MT020_21-22nt_MK929590_Peach_latent_mosaic_viroid_sequence_variants.vcf.gz
08_report summary (summary of results for all samples included in the index.csv file. This includes a cross-contamination prediction) run_top_scoring_targets_with_cov_stats_with_cont_flag_21-22nt_0.01.txt
...
Future potential additional features:
Include a deduplication step for fastq files that have UMIs incorporated
Make QC filtering optionalIncorporate the fastq file initial filtering steps from sRNAqc as option
Work on final summary report
Add coverage statistics and cross contamination flag logic to local db blast results
Incorporate VirusDetect in the pipeline and derive a summary of results from both pipelines
Perform automatically 21-22nt and 24nt analyses by default