Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Table of Contents
minLevel1
maxLevel6
outlinefalse
stylenone
typelist
printablefalse

What is metagenomics?

Metagenomics is the study of the structure and function of sequences isolated and analysed from all the organisms in a bulk sample. Metagenomics is often used to study a specific community of microorganisms, such as those residing on human skin, in the soil or in a water sample.

Metagenomics usually refers to microorganism samples, whereas environmental DNA (eDNA), while using overlapping tools and analysis, refers to other groups of organisms (such as metazoans).

An example - a gut content analysis examining the community structure of bacteria (microbiome) via 16S amplicon sequencing would typically be referred to as a metagenome study. Whereas if the assessment of the gut content was instead exploring what the animal’s diet was (what plants they have eaten, for example), using another amplicon marker (e.g. Cytochrome b) would be an eDNA study.

...

Amplicon vs shotgun (whole genome) sequencing

While whole genome sequencing provides a comprehensive view of all the genetic variations within a sample, amplicon sequencing focuses on sequencing specific genomic regions (like the 16s rRNA gene). This targeted approach makes amplicon sequencing more cost-effective than whole genome sequencing.

Shotgun metagenomic sequencing, unlike 16S rRNA sequencing, can read all genomic DNA in a specimen rather than just one portion of a particular gene. Shotgun sequencing can simultaneously identify and profile bacteria, fungi, viruses, and a variety of other microorganisms, which is useful for microbiome research.

Pros and cons of amplicon vs whole genome sequencing:

Amplicon

Whole genome

Dataset size

Very small

Medium to very large

Computational resources

Small

Medium to very large

Price

Low

Medium to high

Taxonomic resolution

Mostly genus

Species or strain

Functional analysis

Limited

Greater detail

Database curation

Detailed

Minimal

Taxonomic coverage

Specific (e.g. 16s = bacteria)

All taxa

image-20240527-224247.pngImage Added

Analysis of the microbiome: Advantages of whole genome shotgun versus 16S amplicon sequencing

Our study demonstrates that whole genome shotgun sequencing has multiple advantages compared with the 16S amplicon method including enhanced detection of bacterial species, increased detection of diversity and increased prediction of genes. In addition, increased length, either due to longer reads or the assembly of contigs, improved the accuracy of species detection.

Full length amplicon vs hypervariable regions

16s/18s/ITS, etc

WorkflowsThe 16S rRNA gene is about 1,500 bases in length. Illumina reads are much shorter than this, therefore amplicon analysis typically involves sequencing 2-3 ‘hypervariable’ 16S regions.

https://www.nature.com/articles/s41598-023-30764-z

...

Different groups of bacteria are better represented by specific regions.

image-20240527-053657.pngImage Added

With the development of 3rd generation long read sequencing, such as Nanopore and PacBio, the full 16S length can be sequenced. This reduces bias and improves taxonomic resolution.

https://www.nature.com/articles/s41598-020-80826-9

image-20240527-054632.pngImage Added

NOTE: For eukaryotic (e.g. fungi) metagenomic amplicon sequencing, ITS (Internal transcribed spacer) regions are used.

...

...

https://onlinelibrarydata.wiley.com/doi/full/10.1111/1755-0998.13847

https://nanoporetech.com/resource-centre/epi2me-16s-workflow-real-time-identification-bacteria-and-archaea

eresearchqut.net/paulw/public/mahsa_manuscript2/index.html

ASV vs OTU

Taxonomic assignments in nfcore/ampliseq are based on Amplicon sequence variants (ASV), inferred using the DADA2 software package by matching the sample sequences to the SILVA ribosomal RNA sequence database.

DADA2 infers sample sequences exactly and resolves differences of as little as 1 nucleotide.

SILVA provides comprehensive, quality checked and regularly updated datasets of aligned small (16S/18S, SSU) and large subunit (23S/28S, LSU) ribosomal RNA (rRNA) sequences for all three domains of life (Bacteria, Archaea and Eukarya).

Traditionally Operational Taxonomic Units (OTUs) have been used in 16S ampicon studies. More recently ASVs have been used, due their improved accuracy in identifying taxa, particularly genus and species. Basically, OTUs utilise a similarity clustering method to identify taxa, whereas ASV are generated by quantifying exact sequence matches to an amplicon database (e.g. Silva or Greengenes) and then statistically adjusting this using confidence thresholds.

The OTU method typically can identify 97% similarity (with any accuracy) whereas the ASV method can identify even single base-pair differences. This enables a finer resolution of taxa down to the genus and species level. Note that there is increasing ‘fuzziness’ toward the lower taxonomic levels, as the diversity within some taxa is greater than the diversity between this and other taxa (in other words, even with ASV, not all taxa can be resolved to lower taxonomic levels and this is highly dependent on the taxonomic group involved).

https://www.zymoresearch.com/blogs/blog/microbiome-informatics-otu-vs-asv

image-20240227-041310.png

Illumina vs Nanopore sequencing technologies

...

It’s important to note that a significant difference between 2nd and 3rd generation technology is accuracy. The error rate (i.e. the number of bases with low sequencing quality scores) of 3nd 3rd gen has been considerably higher than 2nd gen, with typically ~0.1% error rate for Illumina sequences and >5% error rate for Nanopore. The Nanopore error rate has improved dramatically in recent years though, but still is considerably lower than Illumina. This higher error rate can cause issues, such as in metagenomics when identifying species that differ by a small number of base pairs.

...

https://www.mdpi.com/2073-4425/11/9/1105

2.8. Sequence Data Availability

The Illumina and nanopore sequence datasets of the nose swab samples, generated and

...

analysed in the current study, are available in the European Nucleotide Archive (ENA) under accession number PRJEB28612

https://www.ebi.ac.uk/ena/browser/view/PRJEB28612

...

View file
nameena-file-download-read_run-PRJEB28612-submitted_ftp-20240312-0340.sh

...

Why Nextflow?

What is Nextflow was covered in session 1 of these workshops (Installing Nextflow ).

“scalable and reproducible scientific workflows using software containers.” https://www.nextflow.io/

Bioinformatics workflows are complicated, and are becoming more complicated. Nextflow enables managed, reproducible curated bioinformatics workflows to be used by non-bioinformatician researchers.

Why do we use Nextflow?

  1. Complexity: Analysis workflows are becoming more complex, with more steps (but producing more accurate, publishable results).

  • 100’s of published workflows

  • Curated, multi-tool analyses

  • Optimised and improved over multiple versions

  • Detailed output with results, tables, figures, data files

...

  1. Curated: Managed workflows are typically tested and assembled by experts in the field. Often are improved over multiple versions.

  1. Reproducibility:  Most published studies can’t be reproduced. The ‘reproducibility crisis’. Nexflow has version control, for both the workflow, and the multitude of software tools within.

...

The Nextflow workflows we’ll be running today are:

nfcore/ampliseq for the Illumina data: https://nf-co.re/ampliseq/2.9.0

wf-metagenomics for the Nanopore data: https://github.com/epi2me-labs/wf-metagenomics