Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Tiffanie M Nelson and Jeffrey H Christiansentoc

Contents

1. Executive Summary2

...

Community input is welcomed at all times, as is the nomination of additional members of the SIG, by either adding comments directly to this google document, or by emailing communities@biocommons.org.au

Feedback on the proposed components outlined in this initial draft plan is now sought from the SIG and any other Australian researchers or their collaborators undertaking metagenomics and microbiome analyses.

...

Figure 2: Estimates of the increasing number of microbiome analysis studies conducted in Australia

To gain an estimate of the number of microbiome analysis studies that have been conducted historically in Australia, a search was conducted of the Scopu for articles with: either A/ ‘shotgun’ or ‘metagenomic’ in the title, abstract, or keyword and ‘Australia’ in the affiliation; or B/ ‘amplicon’ or ‘microbiome’ or ‘microbiota’ or ‘microbial community’ or ‘virome’ in the title, abstract or keyword and genome or sequencing or sequence or genomic or next-generation in the title, abstract or keyword and ‘Australia’ in the affiliation. Articles retrieved from the search were manually reviewed to include only those whose focus included the production of data using a marker gene or metagenomic sequencing method and excluded others whose focus was on developing or evaluating analysis methods or tools. Articles that were retrieved during multiple searches were limited to include only one representative article categorised to either marker gene or shotgun. The complete list of citations including abstracts can be found here.

In late July 2020, the Australian BioCommons invited over 100 researchers across Australia to participate in a Microbiome Analysis Special Interest Group (SIG). These researchers were identified as having experience in, or interest in, microbiome analysis. The Australian BioCommons sought information from the SIG about each member’s level of expertise, current (and desired) practices and infrastructure used via an on-line survey (number of respondents = 33), and also held an open video conference follow-up to gain further information (minutes and a recording of the meeting are available).

Respondents to the survey and attendees at the meeting collectively indicated they are performing microbiome analyses on both samples from environmental (i.e. marine, freshwater, soil, and air) as well as host-associated (e.g. animals, plants, corals and humans) habitats. The collective responses also indicated that all of the following approaches are being undertaken by Australian researchers:targeted amplicon sequencing, random shotgun sequencing, taxonomic profiling, functional profiling, generating metagenome-assembled genomes (MAGs), phylogenetic analysis, statistical analyses, and novel gene discovery.

...

Based on information received from the SIG members through the survey (n=33), most researchers use a combination of sequencing platforms to generate their data with the most popular being Illumina, Nanopore, and PacBio.

...

For functional classification, the Kyoto Encyclopedia of Genes and Genomes (KEGG) databases (which provides information relating to the functional classification of cells and organisms), is accessed by 72% of survey respondents.

3.3.2 Tools

Based on the survey, approximately 100 software tools, pipelines, or packages were identified as being used by respondents for various stages of the microbiome analysis process. These are listed in Appendix 1 of this document.

The data generated in either amplicon marker gene or shotgun metagenomic surveys present a wide variety of possible analysis pathways/workflows to pursue and there are many options for tools/pipelines or processes at each step of a chosen bioinformatic pathway.

...

Several researchers (36%, n=12) reported that they were not using their preferred tools/pipelines (primarily due to not having access to sufficient computational memory to run these tools - see Section 3.4.1) and instead had resorted to a workaround solution with other tools.

...

Figure 4. Schematic diagram showing the proposed infrastructure to support microbiome analyses, and data flow

(D1) Sequence reads or other relevant data are inputs into the Platform for Taxonomic and Functional Microbiome Analyses which provides a command-line interface (CLI)- or graphical user interface (GUI)-based access to tools and workflows for performing amplicon marker gene clustering or metagenome assembly and classification (blue shapes). It is underpinned by sufficient and appropriate computational infrastructure. Closely associated is a data management platform (denoted by the darker green shape) that caters to data management, version control, and association of appropriate (e.g. sample, experimental) metadata with the data files. Outputs of D1 are accessible to both (D2) hosted frameworks to enable researchers to utilise common packages for statistical analysis, visualisation, and exploration of microbiome datasets, and (D3) systems to enable submission/publishing of metagenome-assembled genome files (and sequence read data) to international repositories. Arrows indicate the general flow of data. Thicker arrows indicate increasing data transfer capabilities. See Appendix 1 for a list of tools/pipelines that may be included in D1. Higher resolution image.

D1 - A platform for performing taxonomic and functional analyses of microbiomes;

To address objective 1 (i.e. providing Australian researchers with access to a selection of tools and workflows underpinned by computational resources that allow taxonomic and functional analyses of microbiomes (whether they be derived from amplicon/targeted or shotgun/metagenomics based sequencing approaches) to be performed), it is proposed to implement a platform in Australia, that:

...

D2. Systems to enable statistical analyses and visualisations of microbial community data:

To address objective 2 (i.e. to make it easier for Australian researchers to perform statistical and visualisation analyses of microbiome data), it is proposed to implement:

...

D3 - Systems to enable submission of raw sequencing reads and metagenome-assembled genome files from Australia to appropriate global repositories:

To address objective 3 (i.e. to make it easier to publish high quality and share final raw metagenome-assembled genomes (and relevant input data) in accordance with best-practice open science guidelines) it is proposed to implement:

  1. A temporary ‘staging post’ in Australia for metagenome and microbiome (and sequence read) files ready for public international release. The system should include data/metadata formatting checks (which would be enabled by the use of the data management platforms described in D1-E), and support as detailed in D1-F;

  2. Includes a rapid data transfer from the data management platform or the sharing platform to NCBI and/or ENA; and,

  3. Documentation on how to use the system (including a knowledgebase with community-contributed content).

...

Component

Planned dates for delivery

Notes

D1-Aa. Key tools/workflows installed as modules and optimised for CLI access across a variety of Tier 1 and Tier 2 HPC infrastructures.

Ongoing

As of November 2020, 6 of the tools listed in Appendix 1 (graftm, groopm, metacv, QIIME, QIIME2.0, SortMeRna) are installed as modules on QRIScloud/UQ-RCC HPC machines (Tinaroo, Awoonga, FlashLite).

Installation of further tools as modules across NCI, Pawsey, and QRIScloud/UQ-RCC infrastructures to support microbiome analysis is being undertaken in the BioCommons.

Preliminary discussions have been held with the MGnify group at EBI to install and host a MGnify (which offers specialised workflows for three different data types: amplicon, raw metagenomic/ metatranscriptomic reads, and assembly) on Australian BioCommons associated infrastructure, as well as the Marine Metagenomics group from ELIXIR-Norway surrounding the local installation of the Meta-Pipe workflow (for pre-processing, assembly, taxonomic classification and functional analysis of marine metagenomics data).

D1-Aa. CLI platform appropriately resourced for performing microbiome analyses

Ongoing

BioCommons partner infrastructures at NCI, Pawsey, and QCIF include machines that are capable of performing any part of microbiome analysis. This includes FlashLite at QCIF/UQ which can be structured to allow ‘supernodes’ of up to 8TB)

Enabling increased access to partner HPC systems via mechanisms other than through the National Computational Merit Allocation Scheme

(NCMAS) or partner shares are under active exploration by the BioCommons.

D1-Ab. Key tools/workflows installed as modules and optimised on Galaxy Australia.

Ongoing

As of November 2020, 4 of the tools listed in Appendix 1 (maxbin2, metaSPAdes, mothur, SortMeRna) are installed on Galaxy Australia.

Installation of further tools on Galaxy Australia can be requested by any member of the community at any time.

D1-Ab. Galaxy Australia appropriately resourced for performing microbiome analyses

Q1 2021

In addition to the 465 cores at QCIF, UMelb, and Pawsey that currently underpins Galaxy Australia, the Australian BioCommons has secured ARDC funding to purchase an additional minimum of 1x 4TB and 3x 2TB high memory nodes to contribute computational resources to Galaxy Australia. These nodes will be reserved for specific tools requiring high memory, such as those required for MAG assembly.

D1-Ac. Key tools available as high quality trusted software containers for self-deployment on institutional or independent computational infrastructures

Ongoing

Development of containerised tools to support various life science researcher communities in Australia (including microbiome analysis) is being undertaken in the BioCommons.

D1-B. Connectable to Nationally available storage (e.g. Cloudstor)

Ongoing

In late 2020, a direct connection between .

Streamlined connectivity of Cloudstor storage to Pawsey, QCIF, NCI, and other computational resources will continue in the BioCommons.

D1-C/D2-B. Appropriate user authorisation and sharing mechanisms

Ongoing

AAF is currently engaged by the BioCommons to explore Access and Authentication Frameworks that will be fit for purpose across all envisaged BioCommons-related platforms and services.

D1-G. Tool and software workflow documentation with community contributed content.

Ongoing

Tool and workflow documentation for other researcher communities (e.g. de novo genome assembly, and genome annotation) are being organised via an Australian BioCommons Github: https://github.com/australianbiocommons. This avenue is available for the microbiome analysis community.

D1-H. Training re. containerisation of software tools.

Ongoing

Introductory level training around software containerisation (co-organised by BioCommons and Pawsey) occurred in June/July 2020 and will be repeated throughout 2021, 2022, and 2023. See https://www.biocommons.org.au/events/containers-intro and the Australian for recordings of these events.

...

Component

Notes

D1-D. A data management system that is tightly linked to the Microbiome Platforms

Considerations for what may be the best technical solution are ongoing. See Requirements of a Data Management Component of the Australian

D1-H Training re. taxonomic and functional bioinformatics of shotgun and targeted sequencing projects

Discussions with EBI to potentially deliver microbiome analysis related bioinformatics training events to an Australian audience during 2021 or 2022 have begun.

D2-A. Hosted frameworks to enable researchers to utilise common packages for statistical analysis, visualisation, and exploration of microbiome datasets

‘Interactive environments’ offered through the Galaxy platform include R-Studio, JupyterLab, CloudStor SWAN, and Phinch. These are currently available publicly through the European public Galaxy instance (see https://live.usegalaxy.eu/), and are planned for release via Galaxy Australia in Q1 2021. Galaxy Interactive environments may represent an option for this feature.

D3-A and D3-B. A temporary ‘staging post’ in Australia for metagenome and microbiome (and sequence read) files ready for public international release, with a rapid data transfer from the data management platform or the sharing platform to NCBI and/or ENA

COPO is a GUI-based metadata platform for brokering life science data submissions to various repositories including the ENA (see https://f1000research.com/articles/9-495).

It is being adopted by the Darwin Tree of Life project in the UK as the tool to enable the data and metadata submission to ENA to be completed for genome assemblies of over 60,000 species native to the British Isles.

The Australian Biocommons is currently exploring whether a locally supported COPO instance can fulfill the requirements of D3-A/D3-B.

...

Workflow Step

High-level component

Tool

Brief description

Link to data/software or article

1

Quality Control

FastQC

Provides a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines.

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

2

Preprocessing

BLAST+

A suite of command line tools to run BLAST which is to search for nucleotide similarities.

https://blast.ncbi.nlm.nih.gov/Blast.cgi

2

Preprocessing

ChimeraSlayer

A chimeric sequence detection utility, compatible with near-full length Sanger sequences and shorter 454-FLX sequences (~500 bp).

http://microbiomeutil.sourceforge.net/

2

Preprocessing

fastp

Tool designed to provide fast all-in-one preprocessing for FastQ files.

https://github.com/OpenGene/fastp

2

Preprocessing

FASTX-Toolkit

A collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing.

http://hannonlab.cshl.edu/fastx_toolkit/

2

Preprocessing

FLASH - Fast Length Adjustment of SHort reads

A very fast and accurate software tool to merge paired-end reads from next-generation sequencing experiments.

https://ccb.jhu.edu/software/FLASH/

2

Preprocessing

MultiQC

A reporting tool that parses summary statistics from results and log files generated by other bioinformatics tools.

https://multiqc.info/docs/

2

Preprocessing

PANDAseq

A program to align Illumina reads, optionally with PCR primers embedded in the sequence, and reconstruct an overlapping sequence.

https://github.com/neufeld/pandaseq

2

Preprocessing

PEAR - Paired-End reAd mergeR

A fast and accurate Illumina Paired-End reAd mergeR.

https://cme.h-its.org/exelixis/web/software/pear/doc.html

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3933873/

2

Preprocessing

Prinseq

Easy and rapid quality control and data preprocessing of genomic and metagenomic datasets.

http://prinseq.sourceforge.net/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3051327/

2

Preprocessing

Prinseq++

A program to filter, reformat or trim genomic and metagenomic sequence data.

https://github.com/Adrian-Cantu/PRINSEQ-plus-plus

2

Preprocessing

SortMeRNA

A program tool for filtering, mapping, and OTU-picking NGS reads in metatranscriptomic and metagenomic data.

https://github.com/biocore/sortmerna

2

Preprocessing

Tagcleaner

A tool to automatically detect and efficiently remove tag sequences.

http://tagcleaner.sourceforge.net/

2

Preprocessing

Trimmomatic

A flexible read trimming tool for Illumina NGS data.

http://www.usadellab.org/cms/?page=trimmomatic

2

Preprocessing

UCHIME/ UCHIME2

Chimera detection tool.

https://www.drive5.com/usearch/manual/uchime2_algo.html

https://www.biorxiv.org/content/10.1101/074252v1.full

2

Preprocessing

VSEARCH

Processes and prepares metagenomics, genomics, and population genomics nucleotide sequence data.

https://github.com/torognes/vsearch

3

OTU/ASV picking clustering

UPARSE

A method for generating clusters (OTUs) from next-generation sequencing reads

http://drive5.com/uparse/

3

OTU/ASV picking clustering

USEARCH

A unique sequence analysis tool with thousands of users worldwide.

https://www.drive5.com/usearch/

4

Taxonomic classification

Centrifuge

A very rapid and memory-efficient system for the classification of DNA sequences from microbial samples.

https://ccb.jhu.edu/software/centrifuge/

4

Taxonomic classification

Focus

An agile composition based approach using non-negative least squares (NNLS) to report the organisms present in metagenomic samples and profile their abundances.

https://peerj.com/articles/425/

4

Taxonomic classification

Gist

A statistical classifier for taxonomic inference for mRNA reads

https://github.com/rhetorica/gist

4

Taxonomic classification

graftm

A tool to identify and classify marker genes in short read datasets.

https://geronimp.github.io/graftM/

4

Taxonomic classification

GTDB-TK

A computationally efficient and able to classify thousands of draft genomes in parallel.

https://github.com/Ecogenomics/GTDBTk

4

Taxonomic classification

Kraken/ KRAKEN2

A taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds.

https://ccb.jhu.edu/software/kraken2/

4

Taxonomic classification

MetaCV

A composition and phylogeny-based algorithm to classify very short metagenomic reads (75-100 bp) into specific taxonomic and functional groups.

https://sourceforge.net/projects/metacv/

4

Taxonomic classification

MetaPhyler

A novel taxonomic classifier for metagenomic shotgun reads, which uses phylogenetic marker genes as a taxonomic reference.

http://metaphyler.cbcb.umd.edu/

4

Taxonomic classification

PhymmBL

a new classification approach for metagenomics data which uses interpolated Markov models (IMMs) to taxonomically classify DNA sequences, c

https://www.cbcb.umd.ed

5

Sequence assembly

AMOS/ MetAMOS

An open-source, modular assembly pipeline built upon AMOS and tailored specifically for metagenomic next-generation sequencing data

https://genomebiology.biomedcentral.com/articles/10.1186/gb-2011-12-s1-p25

5

Sequence assembly

BinSanity

A suite of scripts designed to cluster contigs generated from metagenomic assembly into putative genomes.

https://github.com/edgraham/BinSanity

5

Sequence assembly

Flye

A de novo assembler for single molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies.

https://github.com/fenderglass/Flye

5

Sequence assembly

GATB-minia- pipeline

A de novo assembly pipeline for Illumina data.

https://github.com/GATB/gatb-minia-pipeline

5

Sequence assembly

groopm

A metagenomics binning suite.

http://ecogenomics.github.io/GroopM/

5

Sequence assembly

IDBA-UD

Designed to utilize paired-end reads to assemble low-depth regions and use progressive depth on contigs to reduce errors in high-depth regions.

https://github.com/loneknightpy/idba

https://pubmed.ncbi.nlm.nih.gov/22495754/

5

Sequence assembly

MaxBin/ MaxBin2

A software for binning assembled metagenomic sequences based.

https://toolshed.g2.bx.psu.edu/view/mbernt/maxbin2/cfd50144a871

5

Sequence assembly

MEGAHIT

An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph.

https://github.com/voutcn/megahit

5

Sequence assembly

Meta-IDBA

Meta-IDBA algorithm for assembling reads in metagenomic data, which contain multiple genomes from different species

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3117360/

5

Sequence assembly

MetaBAT2

Clusters metagenomic contigs into different "bins", each of which should correspond to a putative genome.

https://kbase.us/applist/apps/metabat/run_metabat/release?gclid=Cj0KCQjwzbv7BRDIARIsAM-A6-2jVXdjGVpqsE23jl-nGvGJ81IBURBvM6dnevXoA06mQ42RPV_YqhkaAvevEALw_wcB

5

Sequence assembly

MetaCluster

Unsupervised binning method for metagenomic sequences.

https://github.com/mbanf/METACLUSTER

5

Sequence assembly

metaSPAdes

A versatile metagenomic assembler

http://spades.bioinf.spbau.ru/release3.11.1/manual.html

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411777/

5

Sequence assembly

MetaVelvet

An extension of Velvet assembler to de novo metagenome assembly from short sequence reads

http://metavelvet.dna.bio.keio.ac.jp/ https://pubmed.ncbi.nlm.nih.gov/22821567/

5

Sequence assembly

MIRA

DNA sequence data assembler/mapper for whole genome and EST/RNASeq projects.

http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect_intro_whatismira

5

Sequence assembly

S-GSOM

Binning sequences using very sparse labels within a metagenome.

https://bmcbioinformatics.biomedcentral

5

Sequence assembly

SOAPdenovo2

A novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes.

https://github.com/aquaskyline/SOAPdenovo2

5

Sequence assembly

SPADES - St. Petersburg genome assembler

An assembly toolkit containing various assembly pipelines.

https://cab.spbu.ru/software/spades/

5

Sequence assembly

Unicycler

An assembly pipeline for bacterial genomes.

https://github.com/rrwick/Unicycler

5

Sequence assembly

Velvet

A de novo genome assembler specially designed for short read sequencing technologies, such as Solexa or 454.

https://www.ebi.ac.uk/~zerbino/velvet/

6

Gene prediction and alignment

AMR++

A bioinformatics pipeline that interfaces with MEGARes to identify and quantify AMR gene accessions contained within a metagenomic sequence dataset.

https://academic.oup.com/nar/art

6

Gene prediction and alignment

BBMap

Splice-aware global aligner for DNA and RNA sequencing reads. It can align reads from all major platforms.

https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmap-guide/

6

Gene prediction and alignment

BLAT

Accurate and 500 times faster than popular existing tools for mRNA/DNA alignments.

https://genome.cshlp.org/content/12/4/656

6

Gene prediction and alignment

BMGE - Block Mapping and Gathering with Entropy

Designed to select regions in a multiple sequence alignment that are suited for phylogenetic inference.

https://bmcevolbiol.biomedcentral.com/articles/10.1186/1471-2

6

Gene prediction and alignment

Bowtie/ Bowtie2

An ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences.

http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#getting-started-with-bowtie-2-lambda-phage-example

6

Gene prediction and alignment

BWA

A software package for mapping low-divergent sequences against a large reference genome, such as the human genome.

http://bio-bwa.sourceforge.net/

6

Gene prediction and alignment

CD-HIT

A very widely used program for clustering and comparing protein or nucleotide sequences.

http://weizhongli-lab.org/cd-hit/

6

Gene prediction and alignment

DIAMOND

A sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data.

http://www.diamondsearch.org/index.php

6

Gene prediction and alignment

GlimmerMG

A system for finding genes in environmental shotgun DNA sequences.

http://www.cbcb.umd.edu/software/glimmer-mg/

6

Gene prediction and alignment

HMMER

Biosequence analysis using profile hidden Markov models.

http://hmmer.org/

6

Gene prediction and alignment

Infernal - INFERence of RNA ALignment

A useful tool for identifying RNAs in metagenomics data sets.

http://eddylab.org/infernal/

6

Gene prediction and alignment

IQ-TREE

Phylogenetic tree inference by maximum likelihood.

http://www.iqtree.org/

6

Gene prediction and alignment

MAFFT - Multiple Alignment with Fast Fourier Transform

A multiple sequence alignment program.

http://evomics.org/resources/software/bioinformatics-software/mafft/

6

Gene prediction and alignment

mauve

A system for constructing multiple genome alignments in the presence of large-scale evolutionary events such as rearrangement and inversion.

http://darlinglab.org/mauve/mauve.html

6

Gene prediction and alignment

MetaGene Annotator

A gene-finding program for prokaryote and phage.

http://metagene.nig.ac.jp/

6

Gene prediction and alignment

MetaGeneMark

Novel genomic sequences can be analyzed either by the self-training program GeneMarkS(sequences longer than 50 kb) or by GeneMark.hm.

http://exon.gatech.edu/meta_gmhmmp.cgi

6

Gene prediction and alignment

Minimap2

A general-purpose alignment program to map DNA or long mRNA sequences against a large reference database.

https://github.com/lh3/minimap2

https://academic.oup.com/bioinformatics/article/34/18/3094/4994778

6

Gene prediction and alignment

MinPath/ MinPath2

Minimal set of Pathways is for biological pathway reconstructions using protein family predictions.

https://omics.informatics.indiana.edu/MinPath/

http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000465

6

Gene prediction and alignment

NAST-iEr

Aligns a single raw nucleotide sequence against one or more NAST formatted sequences.

http://microbiomeutil.sourceforge.net/#A_NASTiEr

6

Gene prediction and alignment

PhyloSift

A suite of software tools to conduct phylogenetic analysis of genomes and metagenomes.

https://github.com/gjospin/PhyloSift

6

Gene prediction and alignment

PSORTm / PSORTb

For protein subcellular localization prediction (SCL).

https://www.psort.org/psortm/

6

Gene prediction and alignment

pyani

a Python package and standalone program for calculation of whole-genome similarity measures.

https://pyani.readthedocs.io/_/downloads/en/latest/pdf/

6

Gene prediction and alignment

TETRA

A web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences.

https://bmcbioinformatics.bio

6

Gene prediction and alignment

tRNAscan-SE

The de facto tool for predicting tRNA genes in whole genomes.

http://trna.ucsc.edu/tRNAscan-SE/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6768409/

7

Annotation prediction

BlastKOALA/ GhostKOALA

An automatic annotation server for genome and metagenome sequences, which perform KO (KEGG Orthology) assignments to characterize individual gene functions and reconstruct KEGG pathways.

https://www.sciencedirect.com/science/article/pii/S002228361500649X

7

Annotation prediction

dbCAN

A web server for automated Carbohydrate-active enzyme ANnotation.

http://bcb.unl.edu/dbCAN2/

7

Annotation prediction

eggNOG- mapper

A tool for fast functional annotation of novel sequences.

https://github.com/eggnogdb/eggnog-mapper

7

Annotation prediction

KAAS - KEGG Automatic Annotation Server

Provides functional annotation of genes by BLAST or GHOST comparisons against the manually curated KEGG GENES database.

https://www.genome.jp/kegg/kaas/

7

Annotation prediction

KofamKOALA

A web server to assign KEGG Orthologs (KOs) to protein sequences by homology search.

https://www.genome.jp/tools/kofamkoala/ https://academic.oup.com/bioinformatics/article/36/7/2251/5631907

7

Annotation prediction

PICRUSt/ PICRUSt2

A method to predict approximate functional potential of a community based on marker gene sequencing profiles.

https://github.com/picrust/picrust2

https://www.biorxiv.org/content/10.1101/672295v1.full

7

Annotation prediction

PROKKA

Annotation tool for bacterial, archaeal, and viral genomes.

http://www.metagenomics.wiki/tools/annotation/prokka

7

Annotation prediction

SUPER-FOCUS

A tool for metagenomics functional analysis, and it uses the SEED database.

https://github.com/metageni/SUPER-FOCUS

7

Annotation prediction

Tax4Fun2

An R-based tool for the rapid prediction of habitat-specific functional profiles and functional redundancy based on 16S rRNA gene marker gene sequences.

https://sourceforge.net/projects/tax4fun2/

https://www.biorxiv.org/content/10.1101/490037v1.full.pdf

8

Assembly Validation

CheckM

A set of tools for assessing the quality of genomes recovered from isolates, single cells, or metagenomes.

https://ecogenomics.github.io/CheckM/

8

Assembly Validation

CheckV

For assessing the quality of metagenome-assembled viral genomes.

https://www.biorxiv.org/content/10.1101/2020.05.06.081778v1

8

Assembly Validation

CompareM

A software toolkit which supports performing large-scale comparative genomic analyses. It provides statistics across sets of genomes (e.g., amino acid identity) and for individual genomes.

https://github.com/dparks1134

8

Assembly Validation

Valet

Evaluating metagenomic assemblies.

https://github.com/marbl/VALET

9

Statistical analysis and visualisation

DADA2

Fast and accurate sample inference from amplicon data with single-nucleotide resolution.

https://benjjneb.github.io/dada2/index.html

9

Statistical analysis and visualisation

Krona

Allows hierarchical data to be explored with zooming, multi-layered pie charts.

https://github.com/marbl/Krona/wiki

9

Statistical analysis and visualisation

Metagenome Seq

Designed to determine features (be it Operational Taxonomic Unit (OTU), species, etc.) that are differentially abundant between two or more groups.

https://www.bi

9

Statistical analysis and visualisation

MetaPath

Identify differentially abundant pathways in metagenomic data-sets.

https://www.cbcb.umd.edu/software/metapath

9

Statistical analysis and visualisation

Phyloseq

A set of classes and tools to facilitate the import, storage, analysis, and graphical display of microbiome census data.

https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html

10

Databases

CAzy - Carbohydrate-Active enZYmes Database

Describes the families of structurally-related catalytic and carbohydrate-binding modules (or functional domains) of enzymes that degrade, modify, or create glycosidic bonds.

http://www.cazy.org/

10

Databases

COG Clusters of Orthologous Groups of proteins

A developed system for delineation of Clusters of Orthologous Groups of proteins (COGs) from the sequenced genomes of prokaryotes and unicellular eukaryotes and the construction of clusters of predicted orthologs.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC222959/

10

Databases

Cyanorak

Cyanorak Information system is a bioinformatics tool dedicated to the curation, comparison and visualization of genomes of strains belonging to the subsection I, cluster 5, a deeply branching group within the Cyanobacteria phylum.

http://applic

10

Databases

EBI

European Bioinformatics Institute.

https://www.ebi.ac.uk/

10

Databases

eggNOG

A hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses.

http://eggnog5.embl.de/#/app/home

10

Databases

FunGuild

A python-based tool that can be used to taxonomically parse fungal OTUs by ecological guilds independent of sequencing platforms or analysis pipelines.

http://www.funguild.org/

10

Databases

Greengenes

16S rRNA gene database or experimental datasets.

https://greengenes.secondgenome.com/

10

Databases

GTDB

Genome taxonomy database.

https://gtdb.ecogenomic.org/

10

Databases

InterPro

Functional analysis of proteins by classifying them into families and predicting domains and important sites.

https://www.ebi.ac.uk/interpro/

10

Databases

KEGG: Kyoto Encyclopedia of Genes and Genomes KEGG

KEGG is a database resource for understanding high-level functions and utilities of the biological system

https://www.genome.jp/kegg/

10

Databases

KOG eukaryotic orthologous groups (KOGs)

A eukaryote-specific version of the Clusters of Orthologous Groups (COG) tool for identifying ortholog and paralog protein

https://mycocosm.

https://www.hsls.pitt.edu/obrc/index.php?page=URL1144075392

10

Databases

MAR

Marine databases; MarRef, MarDB and MarCat, which are publicly available resources that promote marine research and innovation.

https://mmp.sfb.uit.no/databases/

https://academic.oup.com/nar/article/46/D1/D692/4584637

10

Databases

MEROPS

An information resource for peptidases (also termed proteases, proteinases and proteolytic enzymes) and the proteins that inhibit them.

https://www.ebi.ac.uk/merops/

https://academic.oup.com/nar/article/46/D1/D624/4626772

10

Databases

MetaCyc

A curated database of experimentally elucidated metabolic pathways from all domains of life.

https://metacyc.org/

10

Databases

NCBI

National Center for Biotechnology Information.

www.ncbi.nlm.nih.gov

10

Databases

PANTHER - Protein ANalysis THrough Evolutionary Relationships)

Designed to classify proteins (and their genes) in order to facilitate high-throughput analysis.

http://www.pantherdb.org/data/

10

Databases

Pfam

A large collection of protein families.

https://pfam.xfam.org/

10

Databases

PR2

A reference database of carefully annotated 18S rRNA sequences using eight unique taxonomic fields.

https://pr2-database.org/

10

Databases

RDP

Provides the research community with aligned and annotated rRNA gene sequence data.

http://rdp.cme.msu.edu/

https://www.ncbi.nlm.nih.gov/pm

10

Databases

Rfam

A collection of RNA families, each represented by multiple sequence alignments.

https://rfam.xfam.org/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383904/

10

Databases

SEED

To provide consistent and accurate genome annotations across thousands of genomes and as a platform for discovering and developing de novo annotations.

https://pubseed.theseed.org/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965101/

10

Databases

Silva

A comprehensive, quality checked and regularly updated datasets of aligned small (16S/18S, SSU) and large subunit (23S/28S, LSU) ribosomal RNA (rRNA) sequences for all three domains of life (Bacteria, Archaea and Eukarya).

https://www.arb-silva.de/

10

Databases

TARA Oceans

Diversity, evolution and ecology of marine plankton.

https://www.ebi.ac.uk/services/tara-oceans-data

http://www.taraoceans-dataportal.org/top/;jsessionid=07217630362165E3CD27AA73D839945D?execution=e1s1

10

Databases

TCDB

A comprehensive IUBMB approved classification system for membrane transport proteins known as the Transporter Classification (TC) system.

http://www.tcdb.org/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1334385/

10

Databases

TIGRFAM

A resource consisting of curated multiple sequence alignments, Hidden Markov Models (HMMs) for protein sequence classification, and associated information designed to support automated annotation of (mostly prokaryotic) proteins.

http://tigrfams.jcvi.org/cgi-bin/index.cgi

11

Other

Anvi'o

An open-source, community-driven analysis and visualization platform for microbial ‘omics.

http://merenlab.org/software/anvio/

11

Other

Calypso

An easy-to-use online software, allowing non-expert users to mine, interpret and compare taxonomic information from metagenomic or 16S rDNA datasets.

http://cgenome.net/wiki/index.php/Calypso

11

Other

CLC Genomics Workbench

A bioinformatics software solution that allows for comprehensive analysis of your NGS data, including de novo assembly of whole genomes and transcriptomes, resequencing analysis.

https://digitalinsights.qiagen.com/products-overview/discovery-insights-portfolio/analysis-and-visualization/qiagen-clc-genomics-workbench/

11

Other

conda

An open source package management system and environment management system that runs on Windows, macOS and Linux.

https://docs.conda.io/en/latest/

11

Other

Galaxy Australia

Galaxy is a web-based analysis and workflow platform.

https://usegalaxy.org.au/

11

Other

gromacs

A versatile package to perform molecular dynamics.

http://www.gromacs.org/

11

Other

IMG/M

A platform to support the annotation, analysis and distribution of microbial genome and microbiome datasets.

https://img.jgi.doe.gov/

11

Other

Jupyter Notebook

A open-source web application that allows you to create and share documents that contain live code,

https://jupyter.org/

11

Other

MEGAN - MEtaGenome ANalyzer

A comprehensive toolbox for interactively analyzing microbiome data.

https://uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/algorithms-in-bioinformatics/software/megan6/

11

Other

MetaORFA - Metagenomic ORFome Assembly

Metagenomic assembly.

http://allie.dbcls.jp/pair/MetaORFA;Metagenomic+ORFome+Assembly.html

11

Other

MetaWRAP

An easy-to-use metagenomic wrapper suite that accomplishes the core tasks of metagenomic analysis from start to finish.

https://github.com/bxlab/metaWRAP

https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-018-0541-1

11

Other

MG-RAST

An automatic phylogenetic and functional analysis of metagenomes.

https://www.mg-rast.org/

11

Other

MGnify

An analysis, archiving and browsing of metagenomic and metatranscriptomic data.

https://www.ebi.ac.uk/metagenomics/

11

Other

MOCAT/ MOCAT2

A package for analyzing metagenomics datasets.

https://mocat.embl.de/

11

Other

Mothur

An open-source, expandable software to fill the bioinformatics needs of the microbial ecology community.

https://www.mothur.org/

11

Other

Nextflow

A scalable and reproducible scientific workflow using software containers.

https://www.nextflow.io/

11

Other

OTUreporter

A modular automated pipeline for the analysis and report of amplicon data.

https://bitbucket.org/xvazquezc/otureporter/wiki/Home

11

Other

Perl

A general purpose language for getting things done.

https://www.perl.

11

Other

Python

Programming language

https://www.python.org/

11

Other

QIIME2.0

Performing microbiome analysis from raw DNA sequencing data.

https://qiime2.org/

11

Other

R/R Studio

A development environment for R and Python, with a console, syntax-highlighting editor.

https://rstudio.com/

11

Other

RocksDB

A persistent key-value store for flash and RAM storage

https://github.com/facebook/rocksdb

11

Other

singularity

Singularity containers can be used to package entire scientific workflows,

https://singularity.lbl.gov/

11

Other

SOAP - Short Oligonucleotide Analysis Package

A suite of bioinformatics software tools from the BGI Bioinformatics department enabling the assembly, alignment, and analysis of next generation DNA sequencing data.

http://manpages.ubuntu.com/manpages/cosmic/man1/soap.1.htm

11

Other

SqueezeMeta

A fully automatic pipeline for metagenomics/metatranscriptomics, covering all steps of the analysis.

https://github.com/jtamames/SqueezeMeta

https://www.frontiersin.org/articles/10.3389/fmicb.2018.03349/full#h2

11

Other

VAMPS

A collection of tools for researchers to visualize and analyze data for microbial population structures and distributions.

https://vamps2.mbl.edu/

A complete list of tools with more details is available here.

Appendix 2

Survey questions posed to the Microbiome Research Community

...