Code
library(phyloseq)
library(microbiome)
library(vegan)
library(dplyr)
library(tibble)
library(ggplot2)
library(patchwork) # for side-by-side plotsBeta diversity describes how communities differ between samples. The workflow has three steps that should always be considered together:
Each choice encodes assumptions about the data, and different choices can lead to different conclusions. In this block we work through two contrasting paths:
We will visualize both with PCoA and test both with PERMANOVA, then compare the conclusions.
library(phyloseq)
library(microbiome)
library(vegan)
library(dplyr)
library(tibble)
library(ggplot2)
library(patchwork) # for side-by-side plotsps_filt <- readRDS("phyloseq_filtered.rds")
ps_rel <- readRDS("phyloseq_relative.rds")The Bray-Curtis dissimilarity between two samples is the sum of the absolute differences in their taxon abundances, divided by the total abundance across both samples. Values range from 0 (identical) to 1 (no shared taxa). It is the most widely used dissimilarity in microbiome ecology and the default in many tools.
Two properties to keep in mind:
dist_bray <- phyloseq::distance(ps_rel, method = "bray")
# Inspect: it is a 'dist' object — pairwise distances between samples
class(dist_bray)
length(dist_bray) # should be n*(n-1)/2 for n samplesPrincipal Coordinates Analysis (PCoA) takes a distance matrix and finds a low-dimensional configuration of points whose pairwise Euclidean distances best approximate the input distances. The first two axes typically capture the dominant structure.
ord_bray <- ordinate(ps_rel, method = "PCoA", distance = dist_bray)
# Variance explained by each axis
ord_bray$values$Relative_eig[1:5] %>% round(3)p_bray <- plot_ordination(
ps_rel, ord_bray,
color = "Fermentation",
shape = "Time_point"
) +
geom_point(size = 3, alpha = 0.8) +
scale_shape_manual(
values = c(H24 = 16, H48 = 17, W1 = 15, W2 = 18),
na.value = 8, # asterisk for NA timepoints (references)
na.translate = TRUE # show NA in legend
) +
stat_ellipse(aes(group = Fermentation), level = 0.95, linetype = 2) +
labs(title = "Bray-Curtis (relative abundance)") +
theme_bw()
p_brayIn this dataset we expect a strong separation between fermentation types along the first axis, with the spontaneous samples showing greater spread (reflecting their continued community succession across time points) and the inoculated samples forming a tight cluster.
Microbiome count data are compositional: only the relative proportions of taxa carry information, not the absolute counts. The total number of reads in a sample is a property of the sequencing run, not of the underlying biology.
This has a counterintuitive consequence. Suppose taxon A is genuinely unchanged between two samples but taxon B truly increases tenfold. Because the proportions must always sum to 1, the proportion of A will appear to decrease in the second sample, even though nothing happened to it biologically.
Bray-Curtis on relative abundances does not solve this problem — it just computes differences in the proportions, which inherit the constraint. For most analyses this is acceptable, but it is worth knowing what you are committing to.
The centered log-ratio (CLR) transformation addresses this by transforming the data so that the values represent the log-ratio of each taxon to the geometric mean of all taxa in that sample. Once CLR-transformed, the data are no longer compositional and can be analyzed with standard methods (Euclidean distance, linear models, PCA).
Pseudocounts and zeros. The CLR transformation involves logarithms, which are undefined for zero. The standard solution is to add a small pseudocount before the log.
microbiome::transform()does this automatically. More sophisticated zero-replacement methods (e.g. Bayesian-multiplicative replacement in thezCompositionspackage) are available, but the simple pseudocount approach is fine for most exploratory work.
CLR is applied to the filtered count object, not to the relative abundance object. The transformation needs the raw counts (it internally normalizes them).
ps_clr <- microbiome::transform(ps_filt, "clr")
# Inspect: CLR values are real numbers, positive and negative,
# centered around zero per sample
clr_values <- microbiome::abundances(ps_clr)
summary(as.vector(clr_values))Aitchison distance is simply Euclidean distance computed on CLR-transformed data. It is a true metric (unlike Bray-Curtis), is symmetric in the way it treats taxa, and is the natural distance for compositional data.
# CLR-transformed OTU matrix: samples x taxa
# microbiome::abundances() always returns taxa as rows; transpose for dist()
clr_matrix <- t(microbiome::abundances(ps_clr))
dist_aitchison <- dist(clr_matrix, method = "euclidean")ord_aitch <- ordinate(ps_clr, method = "PCoA", distance = dist_aitchison)
ord_aitch$values$Relative_eig[1:5] %>% round(3)
p_aitch <- plot_ordination(
ps_clr, ord_aitch,
color = "Fermentation",
shape = "Time_point"
) +
geom_point(size = 3, alpha = 0.8) +
scale_shape_manual(
values = c(H24 = 16, H48 = 17, W1 = 15, W2 = 18),
na.value = 8, # asterisk for NA timepoints (references)
na.translate = TRUE # show NA in legend
) +
stat_ellipse(aes(group = Fermentation), level = 0.95, linetype = 2) +
labs(title = "Aitchison (CLR-transformed)") +
theme_bw()
p_aitchp_bray + p_aitch + plot_layout(guides = "collect")Questions to ask of the comparison:
PERMANOVA (Permutational Multivariate Analysis of Variance) tests whether the centroids of groups in distance space differ more than expected by chance. It is the standard formal test that goes with a beta diversity ordination.
The test takes any distance matrix and a model formula, permutes the group labels many times to build a null distribution, and reports an F-like statistic, an R² (the proportion of variation explained by each term), and a p-value.
The full dataset contains spontaneous fermentations, Lactobacillus-inoculated fermentations, plus three commercial product replicates (two of which had much lower sequencing depth than the rest) and a single unfermented brine baseline. The latter samples are useful as visual reference points on the PCoA — that’s why we kept them in the ordinations above — but they cannot be tested formally:
Fermentation * Time_point becomes singular when a level appears at only some timepoints.The standard fix is to subset the analysis to the two conditions that have replication across all timepoints — Spontaneous and Lactobacillus — and run PERMANOVA on that subset only.
# Subset the phyloseq object to the two replicated conditions
ps_test <- subset_samples(
ps_filt,
Fermentation %in% c("Spontaneous", "Lactobacillus")
)
# Recompute the distance matrices on the subset
dist_bray_test <- phyloseq::distance(
microbiome::transform(ps_test, "compositional"),
method = "bray"
)
ps_test_clr <- microbiome::transform(ps_test, "clr")
clr_test_matrix <- t(microbiome::abundances(ps_test_clr))
dist_aitch_test <- dist(clr_test_matrix, method = "euclidean")
# Metadata for the test
meta_test <- microbiome::meta(ps_test)set.seed(42) # PERMANOVA permutes — set a seed for reproducibility
adonis2(
dist_bray_test ~ Fermentation * Time_point,
data = meta_test,
permutations = 999,
by="terms"
)set.seed(42)
adonis2(
dist_aitch_test ~ Fermentation * Time_point,
data = meta_test,
permutations = 999,
by="terms"
)The output table has one row per term in the model. Look at:
For this dataset, the expected pattern is a large R² for Fermentation, a smaller but non-trivial R² for Time_point, and a small interaction. Both Bray-Curtis and Aitchison should agree on this ordering, although the absolute R² values will differ.
An assumption worth checking: dispersion. PERMANOVA can be sensitive to differences in within-group dispersion (some groups are tighter clusters than others). A significant PERMANOVA result with very different group dispersions can reflect dispersion differences rather than centroid differences. The
vegan::betadisper()function tests this. In this dataset the inoculated samples are visibly tighter than the spontaneous samples, so dispersion does differ — but the centroid separation is so large that the PERMANOVA conclusion is robust to it.
# Check dispersion homogeneity for the Fermentation factor,
# on the same subset used for PERMANOVA
disp_bray <- betadisper(
dist_bray_test,
group = meta_test$Fermentation
)
anova(disp_bray)
plot(disp_bray)The CLR object will not be needed in the remaining blocks, but is worth saving in case you want to reuse it.
saveRDS(ps_clr, file = "phyloseq_clr.rds")betadisper when in doubt, especially when group sizes are unequal.method = "jaccard", binary = TRUE) — a presence/absence-only distance. Plot the PCoA and run PERMANOVA. How does ignoring abundance information change the picture?phyloseq::ordinate() with method = "NMDS" instead of PCoA on the Bray-Curtis distance. NMDS is non-parametric and often interesting to compare to PCoA. Inspect the stress value.Time_point first in the formula (Time_point * Fermentation). Does the order change the result? Why or why not? (Hint: adonis2 defaults to sequential by = "terms".)microbiome::aggregate_taxa(ps_rel, "Genus") and recompute Bray-Curtis. Compare the PCoA to the ASV-level version. When would you prefer one over the other?