6  Wrap-up, differential abundance, and further reading

The workshop took you from a DADA2 output to a set of defensible biological conclusions. The decisions you made along the way were:

Stage Decision What we chose
Object construction Which data structure? phyloseq (with a forward-pointer to TreeSummarizedExperiment)
Decontamination How to handle blanks and organelles? Manual blank threshold + chloroplast/mitochondria removal
Filtering Which ASVs to keep? Singleton removal (taxa_sums > 1)
Alpha diversity Which indices? Observed, Shannon, Pielou’s evenness
Beta diversity Which distance? Bray-Curtis on relative abundance, Aitchison on CLR
Significance test Which test for community-level differences? PERMANOVA via adonis2
Composition view Which taxonomic level? Genus, abundance + prevalence threshold via aggregate_rare()

The same pipeline applies to your own data. The values you choose at each step will be different, but the shape of the workflow is the same.

7 What we did not cover: differential abundance

Differential abundance (DA) analysis tests which specific taxa differ between groups. It is the natural next step after a PERMANOVA shows that groups differ overall.

We omitted DA from the live session for two reasons:

  1. The dataset has n = 3 per group per time point, which is below the threshold where most DA methods produce well-calibrated results.
  2. The composition and ordination plots already show the dominant biological story (Lactobacillus inoculation drives near-monoculture Lactiplantibacillus dominance, spontaneous fermentations show diverse succession). When the visual signal is this strong, DA testing is more confirmatory than discoveries.

For datasets where DA is appropriate, the section below provides a self-study template using MaAsLin3.

7.1 Why MaAsLin3

There are many DA methods for microbiome data. Among them, MaAsLin3 has several properties that make it a sensible default for typical study designs:

  1. It models prevalence and abundance separately, returning two sets of results. A taxon can be associated with a covariate by being more often present in one group, more abundant when present, or both.
  2. It handles continuous and categorical covariates and supports random effects for repeated measurement designs.
  3. It produces both per-taxon results and diagnostic plots out of the box, which is helpful for exploratory work.

Other methods you should know about:

  1. ALDEx2 — uses CLR-transformed values and a Bayesian framework. Strong on compositional principles.
  2. ANCOM-BC2 — bias-corrected log-ratio approach with formal control of the false discovery rate.
  3. LinDA — fast linear-model-based method that handles large datasets well.
  4. corncob — beta-binomial regression on counts, models dispersion explicitly per taxon.

A useful comparison of these methods on real datasets is Nearing et al. (2022), Nature Communications [CITATION NEEDED: confirm full citation]. The general lesson from that paper is that different methods can disagree substantially on the same data, and that agreement across methods is a useful proxy for confidence in a finding.

7.2 Self-study DA section: MaAsLin3

Run this section on your own time, not during the workshop.

Code
library(maaslin3)

# Inputs to MaAsLin3:
#   - feature table: samples x taxa, raw counts (it does its own normalization)
#   - metadata table: samples x variables
#   - output directory: where to write results and plots

ps_filt <- readRDS("phyloseq_filtered.rds")

# Aggregate to genus to make results more interpretable
ps_genus <- aggregate_taxa(ps_filt, level = "Genus")

# Restrict to the two main fermentation conditions
ps_test <- subset_samples(
  ps_genus,
  Fermentation %in% c("Spontaneous", "Lactobacillus")
)

# Drop unused factor levels so MaAsLin3 does not see empty groups
sample_data(ps_test)$Fermentation <- droplevels(
  factor(sample_data(ps_test)$Fermentation)
)

# Extract feature and metadata tables from the subset
feature_table  <- as.data.frame(t(microbiome::abundances(ps_test)))
metadata_table <- microbiome::meta(ps_test)

# Run MaAsLin3
fit <- maaslin3(
  input_data       = feature_table,
  input_metadata   = metadata_table,
  output           = "maaslin3_output",
  formula          = "~ Fermentation + Time_point",
  reference        = "Time_point,H24",
  normalization    = "TSS",
  transform        = "LOG",
  min_abundance    = 0,
  min_prevalence   = 0.1,
  cores            = 1
)

# fit$fit_data_abundance and fit$fit_data_prevalence contain the
# per-taxon results. Significant hits are also written as TSV files
# in the output directory.

Expected hits in this dataset (based on the published results): Lactiplantibacillus much higher in Lactobacillus-inoculated samples, Enterobacter and several other Enterobacteriaceae higher in spontaneous, with time-related changes within each fermentation type.

8 What is next: TreeSummarizedExperiment and miaverse

The phyloseq object is mature and widely used, but the field is gradually shifting to a more general data structure called TreeSummarizedExperiment (TSE), built on Bioconductor’s SummarizedExperiment framework. The associated tooling lives in the miaverse ecosystem (mia, miaViz, miaTime, miaSim).

Why bother learning it?

  1. TSE handles multiple data types (counts, transformed values, metadata for taxa) in a single object more cleanly than phyloseq.
  2. It integrates with the broader Bioconductor multi-omics ecosystem.
  3. New methods are being developed against TSE rather than phyloseq.

The transition is not urgent. Phyloseq remains supported and widely used, and most published code uses it. But if you are starting a long project today, TSE is worth investigating early. The miaverse documentation has a thorough tutorial book at https://microbiome.github.io/OMA/.

9 Resources

A short, opinionated list:

  1. Tutorial collections
  2. Other phyloseq extension packages
  3. Methods papers [CITATION NEEDED: confirm full references]
    • Callahan et al. (2016). DADA2: high-resolution sample inference from Illumina amplicon data. Nature Methods.
    • McMurdie & Holmes (2014). Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Computational Biology.
    • Gloor et al. (2017). Microbiome datasets are compositional: and this is not optional. Frontiers in Microbiology.
    • Nearing et al. (2022). Microbiome differential abundance methods produce different results across 38 datasets. Nature Communications.

10 A final question

When you sit down with your own data, ask yourself the same questions this workshop walked through, in the same order:

  1. What does my count table actually contain? Are there sequences I should be removing before any analysis?
  2. What controls do I have, and what do they tell me about contamination?
  3. Which transformation does each downstream step need?
  4. What does each distance assume? Do my conclusions depend on which one I picked?
  5. Which signal is so obvious it does not need a test, and which needs formal statistics to detect?
  6. What is the simplest plot that shows the answer, and what does it hide?

The mechanics will become routine. The questions are the work.