...

Tiffanie M Nelson and Jeffrey H Christiansentoc

Contents

1. Executive Summary2

...

Community input is welcomed at all times, as is the nomination of additional members of the SIG, by either adding comments directly to this google document, or by emailing communities@biocommons.org.au

Feedback on the proposed components outlined in this initial draft plan is now sought from the SIG and any other Australian researchers or their collaborators undertaking metagenomics and microbiome analyses.

...

Figure 2: Estimates of the increasing number of microbiome analysis studies conducted in Australia

‘amplicon’ or ‘microbiome’ or ‘microbiota’ or ‘microbial community’ or ‘virome’ in the title, abstract or keyword and genome or sequencing or sequence or genomic or next-generation in the title, abstract or keyword and ‘Australia’ in the affiliation. Articles retrieved from the search were manually reviewed to include only those whose focus included the production of data using a marker gene or metagenomic sequencing method and excluded others whose focus was on developing or evaluating analysis methods or tools. Articles that were retrieved during multiple searches were limited to include only one representative article categorised to either marker gene or shotgun. The complete list of citations including abstracts can be found here.

In late July 2020, the Australian BioCommons invited over 100 researchers across Australia to participate in a Microbiome Analysis Special Interest Group (SIG). These researchers were identified as having experience in, or interest in, microbiome analysis. The Australian BioCommons sought information from the SIG about each member’s level of expertise, current (and desired) practices and infrastructure used via an on-line survey (number of respondents = 33), and also held an open video conference follow-up to gain further information (minutes and a recording of the meeting are available).

Respondents to the survey and attendees at the meeting collectively indicated they are performing microbiome analyses on both samples from environmental (i.e. marine, freshwater, soil, and air) as well as host-associated (e.g. animals, plants, corals and humans) habitats. The collective responses also indicated that all of the following approaches are being undertaken by Australian researchers:targeted amplicon sequencing, random shotgun sequencing, taxonomic profiling, functional profiling, generating metagenome-assembled genomes (MAGs), phylogenetic analysis, statistical analyses, and novel gene discovery.

...

Based on information received from the SIG members through the survey (n=33), most researchers use a combination of sequencing platforms to generate their data with the most popular being Illumina, Nanopore, and PacBio.

...

For functional classification, the Kyoto Encyclopedia of Genes and Genomes (KEGG) databases (which provides information relating to the functional classification of cells and organisms), is accessed by 72% of survey respondents.

3.3.2 Tools

Based on the survey, approximately 100 software tools, pipelines, or packages were identified as being used by respondents for various stages of the microbiome analysis process. These are listed in Appendix 1 of this document.

The data generated in either amplicon marker gene or shotgun metagenomic surveys present a wide variety of possible analysis pathways/workflows to pursue and there are many options for tools/pipelines or processes at each step of a chosen bioinformatic pathway.

...

Several researchers (36%, n=12) reported that they were not using their preferred tools/pipelines (primarily due to not having access to sufficient computational memory to run these tools - see Section 3.4.1) and instead had resorted to a workaround solution with other tools.

...

Figure 4. Schematic diagram showing the proposed infrastructure to support microbiome analyses, and data flow

(D1) Sequence reads or other relevant data are inputs into the Platform for Taxonomic and Functional Microbiome Analyses which provides a command-line interface (CLI)- or graphical user interface (GUI)-based access to tools and workflows for performing amplicon marker gene clustering or metagenome assembly and classification (blue shapes). It is underpinned by sufficient and appropriate computational infrastructure. Closely associated is a data management platform (denoted by the darker green shape) that caters to data management, version control, and association of appropriate (e.g. sample, experimental) metadata with the data files. Outputs of D1 are accessible to both (D2) hosted frameworks to enable researchers to utilise common packages for statistical analysis, visualisation, and exploration of microbiome datasets, and (D3) systems to enable submission/publishing of metagenome-assembled genome files (and sequence read data) to international repositories. Arrows indicate the general flow of data. Thicker arrows indicate increasing data transfer capabilities. See Appendix 1 for a list of tools/pipelines that may be included in D1. Higher resolution image.

D1 - A platform for performing taxonomic and functional analyses of microbiomes;

To address objective 1 (i.e. providing Australian researchers with access to a selection of tools and workflows underpinned by computational resources that allow taxonomic and functional analyses of microbiomes (whether they be derived from amplicon/targeted or shotgun/metagenomics based sequencing approaches) to be performed), it is proposed to implement a platform in Australia, that:

...

D2. Systems to enable statistical analyses and visualisations of microbial community data:

To address objective 2 (i.e. to make it easier for Australian researchers to perform statistical and visualisation analyses of microbiome data), it is proposed to implement:

...

D3 - Systems to enable submission of raw sequencing reads and metagenome-assembled genome files from Australia to appropriate global repositories:

To address objective 3 (i.e. to make it easier to publish high quality and share final raw metagenome-assembled genomes (and relevant input data) in accordance with best-practice open science guidelines) it is proposed to implement:

A temporary ‘staging post’ in Australia for metagenome and microbiome (and sequence read) files ready for public international release. The system should include data/metadata formatting checks (which would be enabled by the use of the data management platforms described in D1-E), and support as detailed in D1-F;
Includes a rapid data transfer from the data management platform or the sharing platform to NCBI and/or ENA; and,
Documentation on how to use the system (including a knowledgebase with community-contributed content).

...

Component	Planned dates for delivery	Notes
D1-Aa. Key tools/workflows installed as modules and optimised for CLI access across a variety of Tier 1 and Tier 2 HPC infrastructures.	Ongoing	As of November 2020, 6 of the tools listed in Appendix 1 (graftm, groopm, metacv, QIIME, QIIME2.0, SortMeRna) are installed as modules on QRIScloud/UQ-RCC HPC machines (Tinaroo, Awoonga, FlashLite). Installation of further tools as modules across NCI, Pawsey, and QRIScloud/UQ-RCC infrastructures to support microbiome analysis is being undertaken in the BioCommons. Preliminary discussions have been held with the MGnify group at EBI to install and host a MGnify (which offers specialised workflows for three different data types: amplicon, raw metagenomic/ metatranscriptomic reads, and assembly) on Australian BioCommons associated infrastructure, as well as the Marine Metagenomics group from ELIXIR-Norway surrounding the local installation of the Meta-Pipe workflow (for pre-processing, assembly, taxonomic classification and functional analysis of marine metagenomics data).
D1-Aa. CLI platform appropriately resourced for performing microbiome analyses	Ongoing	BioCommons partner infrastructures at NCI, Pawsey, and QCIF include machines that are capable of performing any part of microbiome analysis. This includes FlashLite at QCIF/UQ which can be structured to allow ‘supernodes’ of up to 8TB) Enabling increased access to partner HPC systems via mechanisms other than through the National Computational Merit Allocation Scheme (NCMAS) or partner shares are under active exploration by the BioCommons.
D1-Ab. Key tools/workflows installed as modules and optimised on Galaxy Australia.	Ongoing	As of November 2020, 4 of the tools listed in Appendix 1 (maxbin2, metaSPAdes, mothur, SortMeRna) are installed on Galaxy Australia. Installation of further tools on Galaxy Australia can be requested by any member of the community at any time.
D1-Ab. Galaxy Australia appropriately resourced for performing microbiome analyses	Q1 2021	In addition to the 465 cores at QCIF, UMelb, and Pawsey that currently underpins Galaxy Australia, the Australian BioCommons has secured ARDC funding to purchase an additional minimum of 1x 4TB and 3x 2TB high memory nodes to contribute computational resources to Galaxy Australia. These nodes will be reserved for specific tools requiring high memory, such as those required for MAG assembly.
D1-Ac. Key tools available as high quality trusted software containers for self-deployment on institutional or independent computational infrastructures	Ongoing	Development of containerised tools to support various life science researcher communities in Australia (including microbiome analysis) is being undertaken in the BioCommons.
D1-B. Connectable to Nationally available storage (e.g. Cloudstor)	Ongoing	In late 2020, a direct connection between . Streamlined connectivity of Cloudstor storage to Pawsey, QCIF, NCI, and other computational resources will continue in the BioCommons.
D1-C/D2-B. Appropriate user authorisation and sharing mechanisms	Ongoing	AAF is currently engaged by the BioCommons to explore Access and Authentication Frameworks that will be fit for purpose across all envisaged BioCommons-related platforms and services.
D1-G. Tool and software workflow documentation with community contributed content.	Ongoing	Tool and workflow documentation for other researcher communities (e.g. de novo genome assembly, and genome annotation) are being organised via an Australian BioCommons Github: https://github.com/australianbiocommons. This avenue is available for the microbiome analysis community.
D1-H. Training re. containerisation of software tools.	Ongoing	Introductory level training around software containerisation (co-organised by BioCommons and Pawsey) occurred in June/July 2020 and will be repeated throughout 2021, 2022, and 2023. See https://www.biocommons.org.au/events/containers-intro and the Australian for recordings of these events.

...

Component	Notes
D1-D. A data management system that is tightly linked to the Microbiome Platforms	Considerations for what may be the best technical solution are ongoing. See Requirements of a Data Management Component of the Australian
D1-H Training re. taxonomic and functional bioinformatics of shotgun and targeted sequencing projects	Discussions with EBI to potentially deliver microbiome analysis related bioinformatics training events to an Australian audience during 2021 or 2022 have begun.
D2-A. Hosted frameworks to enable researchers to utilise common packages for statistical analysis, visualisation, and exploration of microbiome datasets	‘Interactive environments’ offered through the Galaxy platform include R-Studio, JupyterLab, CloudStor SWAN, and Phinch. These are currently available publicly through the European public Galaxy instance (see https://live.usegalaxy.eu/), and are planned for release via Galaxy Australia in Q1 2021. Galaxy Interactive environments may represent an option for this feature.
D3-A and D3-B. A temporary ‘staging post’ in Australia for metagenome and microbiome (and sequence read) files ready for public international release, with a rapid data transfer from the data management platform or the sharing platform to NCBI and/or ENA	COPO is a GUI-based metadata platform for brokering life science data submissions to various repositories including the ENA (see https://f1000research.com/articles/9-495). It is being adopted by the Darwin Tree of Life project in the UK as the tool to enable the data and metadata submission to ENA to be completed for genome assemblies of over 60,000 species native to the British Isles. The Australian Biocommons is currently exploring whether a locally supported COPO instance can fulfill the requirements of D3-A/D3-B.

...

Workflow Step	High-level component	Tool	Brief description	Link to data/software or article
1	Quality Control	FastQC	Provides a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines.	http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
2	Preprocessing	BLAST+	A suite of command line tools to run BLAST which is to search for nucleotide similarities.	https://blast.ncbi.nlm.nih.gov/Blast.cgi
2	Preprocessing	ChimeraSlayer	A chimeric sequence detection utility, compatible with near-full length Sanger sequences and shorter 454-FLX sequences (~500 bp).	http://microbiomeutil.sourceforge.net/
2	Preprocessing	fastp	Tool designed to provide fast all-in-one preprocessing for FastQ files.	https://github.com/OpenGene/fastp
2	Preprocessing	FASTX-Toolkit	A collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing.	http://hannonlab.cshl.edu/fastx_toolkit/
2	Preprocessing	FLASH - Fast Length Adjustment of SHort reads	A very fast and accurate software tool to merge paired-end reads from next-generation sequencing experiments.	https://ccb.jhu.edu/software/FLASH/
2	Preprocessing	MultiQC	A reporting tool that parses summary statistics from results and log files generated by other bioinformatics tools.	https://multiqc.info/docs/
2	Preprocessing	PANDAseq	A program to align Illumina reads, optionally with PCR primers embedded in the sequence, and reconstruct an overlapping sequence.	https://github.com/neufeld/pandaseq
2	Preprocessing	PEAR - Paired-End reAd mergeR	A fast and accurate Illumina Paired-End reAd mergeR.	https://cme.h-its.org/exelixis/web/software/pear/doc.html https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3933873/
2	Preprocessing	Prinseq	Easy and rapid quality control and data preprocessing of genomic and metagenomic datasets.	http://prinseq.sourceforge.net/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3051327/
2	Preprocessing	Prinseq++	A program to filter, reformat or trim genomic and metagenomic sequence data.	https://github.com/Adrian-Cantu/PRINSEQ-plus-plus
2	Preprocessing	SortMeRNA	A program tool for filtering, mapping, and OTU-picking NGS reads in metatranscriptomic and metagenomic data.	https://github.com/biocore/sortmerna
2	Preprocessing	Tagcleaner	A tool to automatically detect and efficiently remove tag sequences.	http://tagcleaner.sourceforge.net/
2	Preprocessing	Trimmomatic	A flexible read trimming tool for Illumina NGS data.	http://www.usadellab.org/cms/?page=trimmomatic
2	Preprocessing	UCHIME/ UCHIME2	Chimera detection tool.	https://www.drive5.com/usearch/manual/uchime2_algo.html https://www.biorxiv.org/content/10.1101/074252v1.full
2	Preprocessing	VSEARCH	Processes and prepares metagenomics, genomics, and population genomics nucleotide sequence data.	https://github.com/torognes/vsearch
3	OTU/ASV picking clustering	UPARSE	A method for generating clusters (OTUs) from next-generation sequencing reads	http://drive5.com/uparse/
3	OTU/ASV picking clustering	USEARCH	A unique sequence analysis tool with thousands of users worldwide.	https://www.drive5.com/usearch/
4	Taxonomic classification	Centrifuge	A very rapid and memory-efficient system for the classification of DNA sequences from microbial samples.	https://ccb.jhu.edu/software/centrifuge/
4	Taxonomic classification	Focus	An agile composition based approach using non-negative least squares (NNLS) to report the organisms present in metagenomic samples and profile their abundances.	https://peerj.com/articles/425/
4	Taxonomic classification	Gist	A statistical classifier for taxonomic inference for mRNA reads	https://github.com/rhetorica/gist
4	Taxonomic classification	graftm	A tool to identify and classify marker genes in short read datasets.	https://geronimp.github.io/graftM/
4	Taxonomic classification	GTDB-TK	A computationally efficient and able to classify thousands of draft genomes in parallel.	https://github.com/Ecogenomics/GTDBTk
4	Taxonomic classification	Kraken/ KRAKEN2	A taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds.	https://ccb.jhu.edu/software/kraken2/
4	Taxonomic classification	MetaCV	A composition and phylogeny-based algorithm to classify very short metagenomic reads (75-100 bp) into specific taxonomic and functional groups.	https://sourceforge.net/projects/metacv/
4	Taxonomic classification	MetaPhyler	A novel taxonomic classifier for metagenomic shotgun reads, which uses phylogenetic marker genes as a taxonomic reference.	http://metaphyler.cbcb.umd.edu/
4	Taxonomic classification	PhymmBL	a new classification approach for metagenomics data which uses interpolated Markov models (IMMs) to taxonomically classify DNA sequences, c	https://www.cbcb.umd.ed
5	Sequence assembly	AMOS/ MetAMOS	An open-source, modular assembly pipeline built upon AMOS and tailored specifically for metagenomic next-generation sequencing data	https://genomebiology.biomedcentral.com/articles/10.1186/gb-2011-12-s1-p25
5	Sequence assembly	BinSanity	A suite of scripts designed to cluster contigs generated from metagenomic assembly into putative genomes.	https://github.com/edgraham/BinSanity
5	Sequence assembly	Flye	A de novo assembler for single molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies.	https://github.com/fenderglass/Flye
5	Sequence assembly	GATB-minia- pipeline	A de novo assembly pipeline for Illumina data.	https://github.com/GATB/gatb-minia-pipeline
5	Sequence assembly	groopm	A metagenomics binning suite.	http://ecogenomics.github.io/GroopM/
5	Sequence assembly	IDBA-UD	Designed to utilize paired-end reads to assemble low-depth regions and use progressive depth on contigs to reduce errors in high-depth regions.	https://github.com/loneknightpy/idba https://pubmed.ncbi.nlm.nih.gov/22495754/
5	Sequence assembly	MaxBin/ MaxBin2	A software for binning assembled metagenomic sequences based.	https://toolshed.g2.bx.psu.edu/view/mbernt/maxbin2/cfd50144a871
5	Sequence assembly	MEGAHIT	An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph.	https://github.com/voutcn/megahit
5	Sequence assembly	Meta-IDBA	Meta-IDBA algorithm for assembling reads in metagenomic data, which contain multiple genomes from different species	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3117360/
5	Sequence assembly	MetaBAT2	Clusters metagenomic contigs into different "bins", each of which should correspond to a putative genome.	https://kbase.us/applist/apps/metabat/run_metabat/release?gclid=Cj0KCQjwzbv7BRDIARIsAM-A6-2jVXdjGVpqsE23jl-nGvGJ81IBURBvM6dnevXoA06mQ42RPV_YqhkaAvevEALw_wcB
5	Sequence assembly	MetaCluster	Unsupervised binning method for metagenomic sequences.	https://github.com/mbanf/METACLUSTER
5	Sequence assembly	metaSPAdes	A versatile metagenomic assembler	http://spades.bioinf.spbau.ru/release3.11.1/manual.html https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411777/
5	Sequence assembly	MetaVelvet	An extension of Velvet assembler to de novo metagenome assembly from short sequence reads	http://metavelvet.dna.bio.keio.ac.jp/ https://pubmed.ncbi.nlm.nih.gov/22821567/
5	Sequence assembly	MIRA	DNA sequence data assembler/mapper for whole genome and EST/RNASeq projects.	http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect_intro_whatismira
5	Sequence assembly	S-GSOM	Binning sequences using very sparse labels within a metagenome.	https://bmcbioinformatics.biomedcentral
5	Sequence assembly	SOAPdenovo2	A novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes.	https://github.com/aquaskyline/SOAPdenovo2
5	Sequence assembly	SPADES - St. Petersburg genome assembler	An assembly toolkit containing various assembly pipelines.	https://cab.spbu.ru/software/spades/
5	Sequence assembly	Unicycler	An assembly pipeline for bacterial genomes.	https://github.com/rrwick/Unicycler
5	Sequence assembly	Velvet	A de novo genome assembler specially designed for short read sequencing technologies, such as Solexa or 454.	https://www.ebi.ac.uk/~zerbino/velvet/
6	Gene prediction and alignment	AMR++	A bioinformatics pipeline that interfaces with MEGARes to identify and quantify AMR gene accessions contained within a metagenomic sequence dataset.	https://academic.oup.com/nar/art
6	Gene prediction and alignment	BBMap	Splice-aware global aligner for DNA and RNA sequencing reads. It can align reads from all major platforms.	https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmap-guide/
6	Gene prediction and alignment	BLAT	Accurate and 500 times faster than popular existing tools for mRNA/DNA alignments.	https://genome.cshlp.org/content/12/4/656
6	Gene prediction and alignment	BMGE - Block Mapping and Gathering with Entropy	Designed to select regions in a multiple sequence alignment that are suited for phylogenetic inference.	https://bmcevolbiol.biomedcentral.com/articles/10.1186/1471-2
6	Gene prediction and alignment	Bowtie/ Bowtie2	An ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences.	http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#getting-started-with-bowtie-2-lambda-phage-example
6	Gene prediction and alignment	BWA	A software package for mapping low-divergent sequences against a large reference genome, such as the human genome.	http://bio-bwa.sourceforge.net/
6	Gene prediction and alignment	CD-HIT	A very widely used program for clustering and comparing protein or nucleotide sequences.	http://weizhongli-lab.org/cd-hit/
6	Gene prediction and alignment	DIAMOND	A sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data.	http://www.diamondsearch.org/index.php
6	Gene prediction and alignment	GlimmerMG	A system for finding genes in environmental shotgun DNA sequences.	http://www.cbcb.umd.edu/software/glimmer-mg/
6	Gene prediction and alignment	HMMER	Biosequence analysis using profile hidden Markov models.	http://hmmer.org/
6	Gene prediction and alignment	Infernal - INFERence of RNA ALignment	A useful tool for identifying RNAs in metagenomics data sets.	http://eddylab.org/infernal/
6	Gene prediction and alignment	IQ-TREE	Phylogenetic tree inference by maximum likelihood.	http://www.iqtree.org/
6	Gene prediction and alignment	MAFFT - Multiple Alignment with Fast Fourier Transform	A multiple sequence alignment program.	http://evomics.org/resources/software/bioinformatics-software/mafft/
6	Gene prediction and alignment	mauve	A system for constructing multiple genome alignments in the presence of large-scale evolutionary events such as rearrangement and inversion.	http://darlinglab.org/mauve/mauve.html
6	Gene prediction and alignment	MetaGene Annotator	A gene-finding program for prokaryote and phage.	http://metagene.nig.ac.jp/
6	Gene prediction and alignment	MetaGeneMark	Novel genomic sequences can be analyzed either by the self-training program GeneMarkS(sequences longer than 50 kb) or by GeneMark.hm.	http://exon.gatech.edu/meta_gmhmmp.cgi
6	Gene prediction and alignment	Minimap2	A general-purpose alignment program to map DNA or long mRNA sequences against a large reference database.	https://github.com/lh3/minimap2 https://academic.oup.com/bioinformatics/article/34/18/3094/4994778
6	Gene prediction and alignment	MinPath/ MinPath2	Minimal set of Pathways is for biological pathway reconstructions using protein family predictions.	https://omics.informatics.indiana.edu/MinPath/ http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000465
6	Gene prediction and alignment	NAST-iEr	Aligns a single raw nucleotide sequence against one or more NAST formatted sequences.	http://microbiomeutil.sourceforge.net/#A_NASTiEr
6	Gene prediction and alignment	PhyloSift	A suite of software tools to conduct phylogenetic analysis of genomes and metagenomes.	https://github.com/gjospin/PhyloSift
6	Gene prediction and alignment	PSORTm / PSORTb	For protein subcellular localization prediction (SCL).	https://www.psort.org/psortm/
6	Gene prediction and alignment	pyani	a Python package and standalone program for calculation of whole-genome similarity measures.	https://pyani.readthedocs.io/_/downloads/en/latest/pdf/
6	Gene prediction and alignment	TETRA	A web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences.	https://bmcbioinformatics.bio
6	Gene prediction and alignment	tRNAscan-SE	The de facto tool for predicting tRNA genes in whole genomes.	http://trna.ucsc.edu/tRNAscan-SE/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6768409/
7	Annotation prediction	BlastKOALA/ GhostKOALA	An automatic annotation server for genome and metagenome sequences, which perform KO (KEGG Orthology) assignments to characterize individual gene functions and reconstruct KEGG pathways.	https://www.sciencedirect.com/science/article/pii/S002228361500649X
7	Annotation prediction	dbCAN	A web server for automated Carbohydrate-active enzyme ANnotation.	http://bcb.unl.edu/dbCAN2/
7	Annotation prediction	eggNOG- mapper	A tool for fast functional annotation of novel sequences.	https://github.com/eggnogdb/eggnog-mapper
7	Annotation prediction	KAAS - KEGG Automatic Annotation Server	Provides functional annotation of genes by BLAST or GHOST comparisons against the manually curated KEGG GENES database.	https://www.genome.jp/kegg/kaas/
7	Annotation prediction	KofamKOALA	A web server to assign KEGG Orthologs (KOs) to protein sequences by homology search.	https://www.genome.jp/tools/kofamkoala/ https://academic.oup.com/bioinformatics/article/36/7/2251/5631907
7	Annotation prediction	PICRUSt/ PICRUSt2	A method to predict approximate functional potential of a community based on marker gene sequencing profiles.	https://github.com/picrust/picrust2 https://www.biorxiv.org/content/10.1101/672295v1.full
7	Annotation prediction	PROKKA	Annotation tool for bacterial, archaeal, and viral genomes.	http://www.metagenomics.wiki/tools/annotation/prokka
7	Annotation prediction	SUPER-FOCUS	A tool for metagenomics functional analysis, and it uses the SEED database.	https://github.com/metageni/SUPER-FOCUS
7	Annotation prediction	Tax4Fun2	An R-based tool for the rapid prediction of habitat-specific functional profiles and functional redundancy based on 16S rRNA gene marker gene sequences.	https://sourceforge.net/projects/tax4fun2/ https://www.biorxiv.org/content/10.1101/490037v1.full.pdf
8	Assembly Validation	CheckM	A set of tools for assessing the quality of genomes recovered from isolates, single cells, or metagenomes.	https://ecogenomics.github.io/CheckM/
8	Assembly Validation	CheckV	For assessing the quality of metagenome-assembled viral genomes.	https://www.biorxiv.org/content/10.1101/2020.05.06.081778v1
8	Assembly Validation	CompareM	A software toolkit which supports performing large-scale comparative genomic analyses. It provides statistics across sets of genomes (e.g., amino acid identity) and for individual genomes.	https://github.com/dparks1134
8	Assembly Validation	Valet	Evaluating metagenomic assemblies.	https://github.com/marbl/VALET
9	Statistical analysis and visualisation	DADA2	Fast and accurate sample inference from amplicon data with single-nucleotide resolution.	https://benjjneb.github.io/dada2/index.html
9	Statistical analysis and visualisation	Krona	Allows hierarchical data to be explored with zooming, multi-layered pie charts.	https://github.com/marbl/Krona/wiki
9	Statistical analysis and visualisation	Metagenome Seq	Designed to determine features (be it Operational Taxonomic Unit (OTU), species, etc.) that are differentially abundant between two or more groups.	https://www.bi
9	Statistical analysis and visualisation	MetaPath	Identify differentially abundant pathways in metagenomic data-sets.	https://www.cbcb.umd.edu/software/metapath
9	Statistical analysis and visualisation	Phyloseq	A set of classes and tools to facilitate the import, storage, analysis, and graphical display of microbiome census data.	https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html
10	Databases	CAzy - Carbohydrate-Active enZYmes Database	Describes the families of structurally-related catalytic and carbohydrate-binding modules (or functional domains) of enzymes that degrade, modify, or create glycosidic bonds.	http://www.cazy.org/
10	Databases	COG Clusters of Orthologous Groups of proteins	A developed system for delineation of Clusters of Orthologous Groups of proteins (COGs) from the sequenced genomes of prokaryotes and unicellular eukaryotes and the construction of clusters of predicted orthologs.	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC222959/
10	Databases	Cyanorak	Cyanorak Information system is a bioinformatics tool dedicated to the curation, comparison and visualization of genomes of strains belonging to the subsection I, cluster 5, a deeply branching group within the Cyanobacteria phylum.	http://applic
10	Databases	EBI	European Bioinformatics Institute.	https://www.ebi.ac.uk/
10	Databases	eggNOG	A hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses.	http://eggnog5.embl.de/#/app/home
10	Databases	FunGuild	A python-based tool that can be used to taxonomically parse fungal OTUs by ecological guilds independent of sequencing platforms or analysis pipelines.	http://www.funguild.org/
10	Databases	Greengenes	16S rRNA gene database or experimental datasets.	https://greengenes.secondgenome.com/
10	Databases	GTDB	Genome taxonomy database.	https://gtdb.ecogenomic.org/
10	Databases	InterPro	Functional analysis of proteins by classifying them into families and predicting domains and important sites.	https://www.ebi.ac.uk/interpro/
10	Databases	KEGG: Kyoto Encyclopedia of Genes and Genomes KEGG	KEGG is a database resource for understanding high-level functions and utilities of the biological system	https://www.genome.jp/kegg/
10	Databases	KOG eukaryotic orthologous groups (KOGs)	A eukaryote-specific version of the Clusters of Orthologous Groups (COG) tool for identifying ortholog and paralog protein	https://mycocosm. https://www.hsls.pitt.edu/obrc/index.php?page=URL1144075392
10	Databases	MAR	Marine databases; MarRef, MarDB and MarCat, which are publicly available resources that promote marine research and innovation.	https://mmp.sfb.uit.no/databases/ https://academic.oup.com/nar/article/46/D1/D692/4584637
10	Databases	MEROPS	An information resource for peptidases (also termed proteases, proteinases and proteolytic enzymes) and the proteins that inhibit them.	https://www.ebi.ac.uk/merops/ https://academic.oup.com/nar/article/46/D1/D624/4626772
10	Databases	MetaCyc	A curated database of experimentally elucidated metabolic pathways from all domains of life.	https://metacyc.org/
10	Databases	NCBI	National Center for Biotechnology Information.	www.ncbi.nlm.nih.gov
10	Databases	PANTHER - Protein ANalysis THrough Evolutionary Relationships)	Designed to classify proteins (and their genes) in order to facilitate high-throughput analysis.	http://www.pantherdb.org/data/
10	Databases	Pfam	A large collection of protein families.	https://pfam.xfam.org/
10	Databases	PR2	A reference database of carefully annotated 18S rRNA sequences using eight unique taxonomic fields.	https://pr2-database.org/
10	Databases	RDP	Provides the research community with aligned and annotated rRNA gene sequence data.	http://rdp.cme.msu.edu/ https://www.ncbi.nlm.nih.gov/pm
10	Databases	Rfam	A collection of RNA families, each represented by multiple sequence alignments.	https://rfam.xfam.org/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383904/
10	Databases	SEED	To provide consistent and accurate genome annotations across thousands of genomes and as a platform for discovering and developing de novo annotations.	https://pubseed.theseed.org/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965101/
10	Databases	Silva	A comprehensive, quality checked and regularly updated datasets of aligned small (16S/18S, SSU) and large subunit (23S/28S, LSU) ribosomal RNA (rRNA) sequences for all three domains of life (Bacteria, Archaea and Eukarya).	https://www.arb-silva.de/
10	Databases	TARA Oceans	Diversity, evolution and ecology of marine plankton.	https://www.ebi.ac.uk/services/tara-oceans-data http://www.taraoceans-dataportal.org/top/;jsessionid=07217630362165E3CD27AA73D839945D?execution=e1s1
10	Databases	TCDB	A comprehensive IUBMB approved classification system for membrane transport proteins known as the Transporter Classification (TC) system.	http://www.tcdb.org/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1334385/
10	Databases	TIGRFAM	A resource consisting of curated multiple sequence alignments, Hidden Markov Models (HMMs) for protein sequence classification, and associated information designed to support automated annotation of (mostly prokaryotic) proteins.	http://tigrfams.jcvi.org/cgi-bin/index.cgi
11	Other	Anvi'o	An open-source, community-driven analysis and visualization platform for microbial ‘omics.	http://merenlab.org/software/anvio/
11	Other	Calypso	An easy-to-use online software, allowing non-expert users to mine, interpret and compare taxonomic information from metagenomic or 16S rDNA datasets.	http://cgenome.net/wiki/index.php/Calypso
11	Other	CLC Genomics Workbench	A bioinformatics software solution that allows for comprehensive analysis of your NGS data, including de novo assembly of whole genomes and transcriptomes, resequencing analysis.	https://digitalinsights.qiagen.com/products-overview/discovery-insights-portfolio/analysis-and-visualization/qiagen-clc-genomics-workbench/
11	Other	conda	An open source package management system and environment management system that runs on Windows, macOS and Linux.	https://docs.conda.io/en/latest/
11	Other	Galaxy Australia	Galaxy is a web-based analysis and workflow platform.	https://usegalaxy.org.au/
11	Other	gromacs	A versatile package to perform molecular dynamics.	http://www.gromacs.org/
11	Other	IMG/M	A platform to support the annotation, analysis and distribution of microbial genome and microbiome datasets.	https://img.jgi.doe.gov/
11	Other	Jupyter Notebook	A open-source web application that allows you to create and share documents that contain live code,	https://jupyter.org/
11	Other	MEGAN - MEtaGenome ANalyzer	A comprehensive toolbox for interactively analyzing microbiome data.	https://uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/algorithms-in-bioinformatics/software/megan6/
11	Other	MetaORFA - Metagenomic ORFome Assembly	Metagenomic assembly.	http://allie.dbcls.jp/pair/MetaORFA;Metagenomic+ORFome+Assembly.html
11	Other	MetaWRAP	An easy-to-use metagenomic wrapper suite that accomplishes the core tasks of metagenomic analysis from start to finish.	https://github.com/bxlab/metaWRAP https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-018-0541-1
11	Other	MG-RAST	An automatic phylogenetic and functional analysis of metagenomes.	https://www.mg-rast.org/
11	Other	MGnify	An analysis, archiving and browsing of metagenomic and metatranscriptomic data.	https://www.ebi.ac.uk/metagenomics/
11	Other	MOCAT/ MOCAT2	A package for analyzing metagenomics datasets.	https://mocat.embl.de/
11	Other	Mothur	An open-source, expandable software to fill the bioinformatics needs of the microbial ecology community.	https://www.mothur.org/
11	Other	Nextflow	A scalable and reproducible scientific workflow using software containers.	https://www.nextflow.io/
11	Other	OTUreporter	A modular automated pipeline for the analysis and report of amplicon data.	https://bitbucket.org/xvazquezc/otureporter/wiki/Home
11	Other	Perl	A general purpose language for getting things done.	https://www.perl.
11	Other	Python	Programming language	https://www.python.org/
11	Other	QIIME2.0	Performing microbiome analysis from raw DNA sequencing data.	https://qiime2.org/
11	Other	R/R Studio	A development environment for R and Python, with a console, syntax-highlighting editor.	https://rstudio.com/
11	Other	RocksDB	A persistent key-value store for flash and RAM storage	https://github.com/facebook/rocksdb
11	Other	singularity	Singularity containers can be used to package entire scientific workflows,	https://singularity.lbl.gov/
11	Other	SOAP - Short Oligonucleotide Analysis Package	A suite of bioinformatics software tools from the BGI Bioinformatics department enabling the assembly, alignment, and analysis of next generation DNA sequencing data.	http://manpages.ubuntu.com/manpages/cosmic/man1/soap.1.htm
11	Other	SqueezeMeta	A fully automatic pipeline for metagenomics/metatranscriptomics, covering all steps of the analysis.	https://github.com/jtamames/SqueezeMeta https://www.frontiersin.org/articles/10.3389/fmicb.2018.03349/full#h2
11	Other	VAMPS	A collection of tools for researchers to visualize and analyze data for microbial population structures and distributions.	https://vamps2.mbl.edu/

A complete list of tools with more details is available here.

Appendix 2

Survey questions posed to the Microbiome Research Community

...

Versions Compared

Old Version 2

New Version Current

Key

3.3.2 Tools

Appendix 2

Page Comparison

Versions Compared

Old Version 2

New Version Current

Key

3.3.2 Tools

Appendix 2