4 Beta diversity, compositionality, and PERMANOVA

Beta diversity describes how communities differ between samples. The workflow has three steps that should always be considered together:

Choose a transformation. What scale should the data be on?
Choose a distance. What does it mean for two samples to be similar?
Choose a method to summarize the distances. Visualization (ordination), formal tests (PERMANOVA), or both.

Each choice encodes assumptions about the data, and different choices can lead to different conclusions. In this block we work through two contrasting paths:

Relative abundance + Bray-Curtis — the standard, straightforward choice. Treats each sample as a community of proportions and measures how different two communities are by the absolute differences in their proportions.
CLR-transformed counts + Aitchison distance — the compositional answer. Acknowledges that 16S data are inherently relative (only ratios between taxa are meaningful) and uses a transformation that makes the data behave like ordinary continuous numbers.

We will visualize both with PCoA and test both with PERMANOVA, then compare the conclusions.

5 Setup

Code

library(phyloseq)
library(microbiome)
library(vegan)
library(dplyr)
library(tibble)
library(ggplot2)
library(patchwork)  # for side-by-side plots

Code

ps_filt <- readRDS("phyloseq_filtered.rds")
ps_rel  <- readRDS("phyloseq_relative.rds")

6 Part 1: Bray-Curtis on relative abundances

6.1 What Bray-Curtis measures

The Bray-Curtis dissimilarity between two samples is the sum of the absolute differences in their taxon abundances, divided by the total abundance across both samples. Values range from 0 (identical) to 1 (no shared taxa). It is the most widely used dissimilarity in microbiome ecology and the default in many tools.

Two properties to keep in mind:

It is not a true metric in the mathematical sense (it does not satisfy the triangle inequality), which has practical consequences for some downstream methods. PCoA handles it well.
It is most informative when computed on relative abundances or on counts that have been scaled to comparable depth. Computing Bray-Curtis on raw counts mixes biological signal with sequencing depth differences.

6.2 Compute the Bray-Curtis distance matrix

Code

dist_bray <- phyloseq::distance(ps_rel, method = "bray")

# Inspect: it is a 'dist' object — pairwise distances between samples
class(dist_bray)
length(dist_bray)  # should be n*(n-1)/2 for n samples

6.3 Ordinate with PCoA

Principal Coordinates Analysis (PCoA) takes a distance matrix and finds a low-dimensional configuration of points whose pairwise Euclidean distances best approximate the input distances. The first two axes typically capture the dominant structure.

Code

ord_bray <- ordinate(ps_rel, method = "PCoA", distance = dist_bray)

# Variance explained by each axis
ord_bray$values$Relative_eig[1:5] %>% round(3)

Code

p_bray <- plot_ordination(
    ps_rel, ord_bray,
    color = "Fermentation",
    shape = "Time_point"
  ) +
  geom_point(size = 3, alpha = 0.8) +
  scale_shape_manual(
    values   = c(H24 = 16, H48 = 17, W1 = 15, W2 = 18),
    na.value = 8,             # asterisk for NA timepoints (references)
    na.translate = TRUE        # show NA in legend
  ) +
  stat_ellipse(aes(group = Fermentation), level = 0.95, linetype = 2) +
  labs(title = "Bray-Curtis (relative abundance)") +
  theme_bw()

p_bray

In this dataset we expect a strong separation between fermentation types along the first axis, with the spontaneous samples showing greater spread (reflecting their continued community succession across time points) and the inoculated samples forming a tight cluster.

7 Part 2: The compositional problem

Microbiome count data are compositional: only the relative proportions of taxa carry information, not the absolute counts. The total number of reads in a sample is a property of the sequencing run, not of the underlying biology.

This has a counterintuitive consequence. Suppose taxon A is genuinely unchanged between two samples but taxon B truly increases tenfold. Because the proportions must always sum to 1, the proportion of A will appear to decrease in the second sample, even though nothing happened to it biologically.

Bray-Curtis on relative abundances does not solve this problem — it just computes differences in the proportions, which inherit the constraint. For most analyses this is acceptable, but it is worth knowing what you are committing to.

The centered log-ratio (CLR) transformation addresses this by transforming the data so that the values represent the log-ratio of each taxon to the geometric mean of all taxa in that sample. Once CLR-transformed, the data are no longer compositional and can be analyzed with standard methods (Euclidean distance, linear models, PCA).

Pseudocounts and zeros. The CLR transformation involves logarithms, which are undefined for zero. The standard solution is to add a small pseudocount before the log. microbiome::transform() does this automatically. More sophisticated zero-replacement methods (e.g. Bayesian-multiplicative replacement in the zCompositions package) are available, but the simple pseudocount approach is fine for most exploratory work.

7.1 Apply the CLR transformation

CLR is applied to the filtered count object, not to the relative abundance object. The transformation needs the raw counts (it internally normalizes them).

Code

ps_clr <- microbiome::transform(ps_filt, "clr")

# Inspect: CLR values are real numbers, positive and negative,
# centered around zero per sample
clr_values <- microbiome::abundances(ps_clr)
summary(as.vector(clr_values))

7.2 Aitchison distance

Aitchison distance is simply Euclidean distance computed on CLR-transformed data. It is a true metric (unlike Bray-Curtis), is symmetric in the way it treats taxa, and is the natural distance for compositional data.

Code

# CLR-transformed OTU matrix: samples x taxa
# microbiome::abundances() always returns taxa as rows; transpose for dist()
clr_matrix <- t(microbiome::abundances(ps_clr))

dist_aitchison <- dist(clr_matrix, method = "euclidean")

7.3 Ordinate and plot

Code

ord_aitch <- ordinate(ps_clr, method = "PCoA", distance = dist_aitchison)

ord_aitch$values$Relative_eig[1:5] %>% round(3)

p_aitch <- plot_ordination(
    ps_clr, ord_aitch,
    color = "Fermentation",
    shape = "Time_point"
  ) +
  geom_point(size = 3, alpha = 0.8) +
  scale_shape_manual(
    values   = c(H24 = 16, H48 = 17, W1 = 15, W2 = 18),
    na.value = 8,             # asterisk for NA timepoints (references)
    na.translate = TRUE        # show NA in legend
  ) +
  stat_ellipse(aes(group = Fermentation), level = 0.95, linetype = 2) +
  labs(title = "Aitchison (CLR-transformed)") +
  theme_bw()

p_aitch

8 Part 3: Compare the two ordinations side by side

Code

p_bray + p_aitch + plot_layout(guides = "collect")

Questions to ask of the comparison:

Do the two methods agree on which samples cluster together? When the biological signal is strong (as in this dataset), yes — both should clearly separate the fermentation types.
Do they agree on the ordering of samples within a group? Often not exactly, because the methods weight rare and abundant taxa differently.
How much variance is captured on the first two axes in each? CLR tends to spread the variance across more axes than Bray-Curtis, which often concentrates variance on PCoA1.
Are there any samples that move noticeably between the two plots? These are samples whose position is sensitive to the compositional vs non-compositional treatment of the data — worth investigating.

9 Part 4: PERMANOVA

PERMANOVA (Permutational Multivariate Analysis of Variance) tests whether the centroids of groups in distance space differ more than expected by chance. It is the standard formal test that goes with a beta diversity ordination.

The test takes any distance matrix and a model formula, permutes the group labels many times to build a null distribution, and reports an F-like statistic, an R² (the proportion of variation explained by each term), and a p-value.

9.1 Subset to the comparable conditions before testing

The full dataset contains spontaneous fermentations, Lactobacillus-inoculated fermentations, plus three commercial product replicates (two of which had much lower sequencing depth than the rest) and a single unfermented brine baseline. The latter samples are useful as visual reference points on the PCoA — that’s why we kept them in the ordinations above — but they cannot be tested formally:

PERMANOVA tests differences between group centroids. A group with n = 1 has no within-group variation, so its “centroid” is simply that one sample’s coordinates.
The interaction term Fermentation * Time_point becomes singular when a level appears at only some timepoints.
The permutation null is degenerate for groups that always have the same single member regardless of label shuffling.

The standard fix is to subset the analysis to the two conditions that have replication across all timepoints — Spontaneous and Lactobacillus — and run PERMANOVA on that subset only.

Code

# Subset the phyloseq object to the two replicated conditions
ps_test <- subset_samples(
  ps_filt,
  Fermentation %in% c("Spontaneous", "Lactobacillus")
)

# Recompute the distance matrices on the subset
dist_bray_test  <- phyloseq::distance(
  microbiome::transform(ps_test, "compositional"),
  method = "bray"
)

ps_test_clr <- microbiome::transform(ps_test, "clr")
clr_test_matrix <- t(microbiome::abundances(ps_test_clr))
dist_aitch_test <- dist(clr_test_matrix, method = "euclidean")

# Metadata for the test
meta_test <- microbiome::meta(ps_test)

9.2 PERMANOVA on Bray-Curtis

Code

set.seed(42)  # PERMANOVA permutes — set a seed for reproducibility

adonis2(
  dist_bray_test ~ Fermentation * Time_point,
  data         = meta_test,
  permutations = 999,
  by="terms"
)

9.3 PERMANOVA on Aitchison

Code

set.seed(42)

adonis2(
  dist_aitch_test ~ Fermentation * Time_point,
  data         = meta_test,
  permutations = 999,
  by="terms"
)

9.4 Interpreting the output

The output table has one row per term in the model. Look at:

R² : proportion of variation in the distance matrix explained by that term. This is the most useful column for interpretation.
p-value (Pr(>F)) : significance of each term given the permutations.

For this dataset, the expected pattern is a large R² for Fermentation, a smaller but non-trivial R² for Time_point, and a small interaction. Both Bray-Curtis and Aitchison should agree on this ordering, although the absolute R² values will differ.

An assumption worth checking: dispersion. PERMANOVA can be sensitive to differences in within-group dispersion (some groups are tighter clusters than others). A significant PERMANOVA result with very different group dispersions can reflect dispersion differences rather than centroid differences. The vegan::betadisper() function tests this. In this dataset the inoculated samples are visibly tighter than the spontaneous samples, so dispersion does differ — but the centroid separation is so large that the PERMANOVA conclusion is robust to it.

Code

# Check dispersion homogeneity for the Fermentation factor,
# on the same subset used for PERMANOVA
disp_bray <- betadisper(
  dist_bray_test,
  group = meta_test$Fermentation
)
anova(disp_bray)
plot(disp_bray)

10 Part 5: Save the CLR-transformed object

The CLR object will not be needed in the remaining blocks, but is worth saving in case you want to reuse it.

Code

saveRDS(ps_clr, file = "phyloseq_clr.rds")

11 Take-home points

The choice of distance encodes assumptions. Bray-Curtis on relative abundances is the standard, familiar choice. Aitchison distance on CLR-transformed data is the principled answer for compositional data.
Run both when in doubt. If they agree, your conclusion is robust to the choice. If they disagree, you have learned something important about your data.
PCoA is the visualization, PERMANOVA is the test. Always report them together. R² is more interpretable than the p-value.
PERMANOVA does not separate centroid differences from dispersion differences. Run betadisper when in doubt, especially when group sizes are unequal.

12 Optional exercises

Compute Jaccard distance (method = "jaccard", binary = TRUE) — a presence/absence-only distance. Plot the PCoA and run PERMANOVA. How does ignoring abundance information change the picture?
Use phyloseq::ordinate() with method = "NMDS" instead of PCoA on the Bray-Curtis distance. NMDS is non-parametric and often interesting to compare to PCoA. Inspect the stress value.
Run PERMANOVA with Time_point first in the formula (Time_point * Fermentation). Does the order change the result? Why or why not? (Hint: adonis2 defaults to sequential by = "terms".)
Aggregate the data to genus level using microbiome::aggregate_taxa(ps_rel, "Genus") and recompute Bray-Curtis. Compare the PCoA to the ASV-level version. When would you prefer one over the other?