Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Retain reads of a given length (e.g. 21-22 or 24 nt long) from fastq file(s) provided in index.csv file (readprocessing)

  • De novo assembly using kmer 15 and coverage 3 (velvet) -

  • Collapse contigs into scaffolds (min length 20) (cap3)

  • Run megablast homology search against NCBI NT database (megablast_nt_velvet)

  • Summarise megablast results and restrict to virus and viroid matches (BlastTools_megablast_velvet)

  • Derive coverage statistics, consensus sequence and VCF matching to top blast hits (filter_n_cov)

...

→ sample 3

etc…

The folders are structures as follows (examples of outputs are provided in italics):

  • 01_read_size_selection (cutadapt log file and fastq file including reads only matching the size specified in the index.csv file) MT020_21-22nt_cutadapt.log & MT020_21-22nt.fastq

  • 02_velvet (velvet results and the fasta file which includes the velvet assembled contigs MT020_velvet_assembly_21-22nt.fasta

  • 02a_spades (if spades is additionally run)

  • 03_cap3 (fasta file of the scaffolds produced by CAP3 as well as the singletons) MT020_velvet_cap3_21-22nt_rename.fasta

  • 04_blastn (all blastn results, filtered results limited to only viruses and viroid top 5 hit matches and their taxonomy) MT020_velvet_21-22nt_megablast_vs_NT.bls, MT020_velvet_21-22nt_megablast_vs_NT_top5Hits.txt, MT020_velvet_21-22nt_megablast_vs_NT_top5Hits_virus_viroids_final.txt MT020_velvet_21-22nt_megablast_vs_NT_top5Hits_virus_viroids_seq_ids_taxonomy.txt

  • 05_blastoutputs (BlastTools.jar summary output which clusters all the contigs matching to a specific hit. summary_MT029_velvet_21-22nt_megablast_vs_NT_top5Hits_virus_viroids_final.txt

  • 06_blastp (blastp outputs) MT020_velvet_21-22nt_getorf.min50aa.fasta, MT020_velvet_21-22nt_getorf.min50aa_blastp_vs_NR_out_virus_viroid.txt

  • 07_filternstats (filtered blast summary with various coverage statistics for each virus and viroid hit, and associated consensus fasta file and vcf file) MT020_21-22nt_top_scoring_targets_with_cov_stats.txt, MT020_21-22nt_MK929590_Peach_latent_mosaic_viroid.consensus.fasta, MT020_21-22nt_MK929590_Peach_latent_mosaic_viroid_sequence_variants.vcf.gz

  • 08_report summary (summary of results for all samples included in the index.csv file. This includes a cross-contamination prediction) run_top_scoring_targets_with_cov_stats_with_cont_flag_21-22nt_0.01.txt

...

Future potential additional features:

  • Include a deduplication step for fastq files that have UMIs incorporated

  • Make QC filtering optionalIncorporate the fastq file initial filtering steps from sRNAqc as option

  • Work on final summary report

  • Add coverage statistics and cross contamination flag logic to local db blast results

  • Incorporate VirusDetect in the pipeline and derive a summary of results from both pipelines

  • Perform automatically 21-22nt and 24nt analyses by default