Overview

What is metagenomics?

Metagenomics is the study of the structure and function of sequences isolated and analysed from all the organisms in a bulk sample. Metagenomics is often used to study a specific community of microorganisms, such as those residing on human skin, in the soil or in a water sample.

Metagenomics usually refers to microorganism samples, whereas environmental DNA (eDNA), while using overlapping tools and analysis, refers to other groups of organisms (such as metazoans).

An example - a gut content analysis examining the community structure of bacteria (microbiome) via 16S amplicon sequencing would typically be referred to as a metagenome study. Whereas if the assessment of the gut content was instead exploring what the animal’s diet was (what plants they have eaten, for example), using another amplicon marker (e.g. Cytochrome b) would be an eDNA study.

Amplicon vs shotgun (whole genome) sequencing

While whole genome sequencing provides a comprehensive view of all the genetic variations within a sample, amplicon sequencing focuses on sequencing specific genomic regions (like the 16s rrna gene). This targeted approach makes amplicon sequencing more cost-effective than whole genome sequencing.

Shotgun metagenomic sequencing, unlike 16S rRNA sequencing, can read all genomic DNA in a specimen rather than just one portion of a particular gene. Shotgun sequencing can simultaneously identify and profile bacteria, fungi, viruses, and a variety of other microorganisms, which is useful for microbiome research.

Pros and cons of amplicon vs whole genome sequencing:

	Amplicon	Whole genome
Dataset size	Very small	Medium to very large
Computational resources	Small	Medium to very large
Price	Low	Medium to high
Taxonomic resolution	Mostly genus	Species or strain
Functional analysis	Limited	Greater detail
Database curation	Detailed	Minimal
Taxonomic coverage	Specific (e.g. 16s = bacteria)	All taxa

Analysis of the microbiome: Advantages of whole genome shotgun versus 16S amplicon sequencing

Full length amplicon vs hypervariable regions

https://nanoporetech.com/resource-centre/epi2me-16s-workflow-real-time-identification-bacteria-and-archaea

https://nf-co.re/ampliseq/2.9.0/docs/usage#taxonomic-classification

ASV vs OTU

Illumina vs Nanopore sequencing technologies

This overview is based on this excellent review article:

https://www.nature.com/articles/s41587-021-01108-x

Nanopore is part of the ‘third generation sequencing’ suite of sequencing technologies, producing longer (~500bp - 100,000bp), fewer reads than the more commonly used ‘second generation’ of sequencing, most typically represented by Illumina fixed-length, short read technology (~50bp - 400bp).

Whereas 2nd generation technology uses massively parallel sequencing - i.e. simultaneous sequencing millions of small DNA fragments annealed to a flow cell - 3rd generation sequencing includes several competing technologies, which differ substantially in terms of underlying technology. The two main companies are Pacific Biosystems ('PacBio') and Oxford Nanopore Technology ('ONT' or ‘Nanopore’).

An overview of 2nd and 3rd gen sequencing technologies can be seen here: https://www.sciencedirect.com/science/article/pii/S0198885921000628

It’s important to note that a significant difference between 2nd and 3rd generation technology is accuracy. The error rate (i.e. the number of bases with low sequencing quality scores) of 3nd gen has been considerably higher than 2nd gen, with typically ~0.1% error rate for Illumina sequences and >5% error rate for Nanopore. The Nanopore error rate has improved dramatically in recent years though, but still is considerably lower than Illumina. This higher error rate can cause issues, such as in metagenomics when identifying species that differ by a small number of base pairs.

Functionally, the longer 3rd gen reads can counter the higher error rates through increased number of potential base matches. For 16S rRNA sequencing, short read Illumina sequences typically cover two 16S hypervariable regions, whereas Nanopore sequences the full 1.5 kilobase 16S sequences, which includes all nine hypervariable regions.

This 2023 paper compared Illumina and Nanopore shotgun sequencing for identifying bacteria strains with little genomic variation between them. Both Illumina and Nanopore were able to correctly identify the bacteria strains, despite the higher error rate of the Nanopore sequences.

Reference paper

Data used in this workshop is from a paper that compared Illumina and Nanopore 16S datasets.

https://www.mdpi.com/2073-4425/11/9/1105

2.8. Sequence Data Availability
The Illumina and nanopore sequence datasets of the nose swab samples, generated and analysed in the current study, are available in the European Nucleotide Archive (ENA) under accession number PRJEB28612

https://www.ebi.ac.uk/ena/browser/view/PRJEB28612

Why Nextflow?

What is Nextflow was covered in session 1 of these workshops (Installing Nextflow ).

“scalable and reproducible scientific workflows using software containers.” https://www.nextflow.io/

Bioinformatics workflows are complicated, and are becoming more complicated. Nextflow enables managed, reproducible curated bioinformatics workflows to be used by non-bioinformatician researchers.

Why do we use Nextflow?

Complexity: Analysis workflows are becoming more complex, with more steps (but producing more accurate, publishable results).

100’s of published workflows
Curated, multi-tool analyses
Optimised and improved over multiple versions
Detailed output with results, tables, figures, data files

Curated: Managed workflows are typically tested and assembled by experts in the field. Often are improved over multiple versions.

Nfcore: central repository for Nextflow pipelines: https://nf-co.re/pipelines
Nanopore has developed Nextflow workflows via epi2me: https://labs.epi2me.io/wfindex/

Reproducibility: Most published studies can’t be reproduced. The ‘reproducibility crisis’. Nexflow has version control, for both the workflow, and the multitude of software tools within.

The Nextflow workflows we’ll be running today are:

nfcore/ampliseq for the Illumina data: https://nf-co.re/ampliseq/2.9.0

wf-metagenomics for the Nanopore data: https://github.com/epi2me-labs/wf-metagenomics

1. Metagenomics overview