Alise Ponsero
Alise Ponsero Bioinformatic Data Scientist at the Quadram Institute Bioscience, Norwich.

Compositional Data Analysis for Microbiome Studies

Compositional Data Analysis for Microbiome Studies

Microbiome data is inherently compositional - the counts we obtain from sequencing provide information about the relative abundance of taxa, not their absolute abundance in the original sample. This has two major implications:

  1. The Constant Sum Constraint: Changes in the abundance of one taxon necessarily affect the proportions of others, since proportions must sum to 1 (or 100%). For example, if one bacterial species doubles in absolute abundance while others remain constant, the relative abundances of all other species will appear to decrease.

  2. Loss of Absolute Information: Two samples can have identical relative abundances despite having vastly different total bacterial loads. This means we can’t directly interpret individual taxa abundances or their changes across conditions.

Enter Compositional Data Analysis (CoDa). The key idea is to transform our data using log-ratios, which preserve the relative relationships between components. The most common approach is the Centered Log-Ratio (CLR) transformation, where each component is divided by the geometric mean of all components before taking the logarithm.

Let’s see how this works in practice using the Human Microbiome Project (HMP) dataset.

Getting Started

First, let’s load our required libraries and prepare our data:

1
2
3
4
5
6
7
8
9
10
11
# Load required libraries
library(HMP16SData)  # For accessing HMP data
library(microbiome)  # For microbiome data manipulation
library(tidyverse)   # For data wrangling and visualization
library(compositions) # For compositional data analysis
library(vegan)       # For ecological analyses
library(ggrepel)     # For improved plot labels

# Load and filter HMP data
V35 <- V35()
ps <- as_phyloseq(V35)

Let’s clean up our dataset by:

  1. Removing samples with missing body site information
  2. Ensuring minimum sequencing depth
  3. Selecting specific body sites of interest
  4. Removing low-abundance taxa
1
2
3
4
5
6
ps <- subset_samples(ps, !is.na(HMP_BODY_SUBSITE))
ps <- prune_samples(sample_sums(ps) > 1000, ps)
body_sites <- c("Saliva", "Tongue Dorsum", 
                "Supragingival Plaque", "Palatine Tonsils")
ps <- subset_samples(ps, HMP_BODY_SUBSITE %in% body_sites)
ps <- prune_taxa(taxa_sums(ps) > 1, ps)

Traditional Analysis: Bray-Curtis PCoA

Let’s start with the traditional approach using Bray-Curtis dissimilarity. This method is widely used in ecology and microbiome studies:

1
2
3
4
5
# First, aggregate taxa at family level
ps.gen <- microbiome::aggregate_taxa(ps, level="GENUS")

# Convert to relative abundances
ps.gen.rel <- microbiome::transform(ps.gen, transform = "compositional") 

Now we’ll create our distance matrix and perform Principal Coordinates Analysis (PCoA):

1
2
3
4
5
6
7
# Prepare our data matrix
otu_table <- as.data.frame(t(otu_table(ps.gen.rel)))
metadata <- as.data.frame(sample_data(ps.gen.rel))

# Calculate Bray-Curtis distances and perform PCoA
bc_dist <- vegdist(otu_table, method = "bray")
bc_pcoa <- cmdscale(bc_dist, k = 2, eig = TRUE)

Let’s visualize our results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Prepare data for plotting
pcoa_df <- data.frame(
  PC1 = bc_pcoa$points[,1],
  PC2 = bc_pcoa$points[,2],
  body_site = metadata$HMP_BODY_SUBSITE
)

# Calculate variance explained
bc_var_explained <- (bc_pcoa$eig/sum(bc_pcoa$eig))[1:2] * 100

# Create the plot
bc_plot <- ggplot(pcoa_df, aes(x = PC1, y = PC2, color = body_site)) +
  geom_point(alpha = 0.7, size = 3) +
  theme_minimal() +
  labs(title = "Bray-Curtis PCoA",
       x = sprintf("PCo1 (%.1f%%)", bc_var_explained[1]),
       y = sprintf("PCo2 (%.1f%%)", bc_var_explained[2])) +
  scale_color_brewer(palette = "Set2") +
  theme(legend.title = element_blank())

Compositional Approach: CLR Transform and PCA

Now let’s try the compositional approach. A key challenge in microbiome data is handling zeros - we can’t take the log of zero! Here’s how we handle it:

1
2
3
4
5
6
7
8
# Aggregate at genus level
ps.gen <- microbiome::aggregate_taxa(ps, level="GENUS")

# Handle zeros by adding a small pseudocount
ps.gen.pseudo <- microbiome::transform(ps.gen, transform = "pseudocount")

# Apply CLR transformation
ps.gen.clr <- microbiome::transform(ps.gen.pseudo, transform = "clr") 

With our CLR-transformed data, we can use standard PCA:

1
2
3
4
5
6
7
8
9
10
# Prepare data matrix
otu_table2 <- as.data.frame(t(otu_table(ps.gen.clr)))
pca_result <- prcomp(otu_table2, scale. = FALSE)

# Prepare plotting data
pca_df <- data.frame(
  PC1 = pca_result$x[,1],
  PC2 = pca_result$x[,2],
  body_site = metadata$HMP_BODY_SUBSITE
)

One advantage of the CLR approach is that we can create biplots showing both sample ordination and taxa contributions. Let’s create both visualizations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Sample ordination plot
pca_scores_plot <- ggplot(pca_df, aes(x = PC1, y = PC2, color = body_site)) +
  geom_point(alpha = 0.7, size = 3) +
  theme_minimal() +
  labs(title = "Aitchison PCA - Sample Ordination",
       x = sprintf("PC1 (%.1f%%)", pca_var_explained[1]),
       y = sprintf("PC2 (%.1f%%)", pca_var_explained[2])) +
  scale_color_brewer(palette = "Set2") +
  theme(legend.title = element_blank())

# Prepare taxa contributions
loadings_df <- data.frame(
  Taxa = rownames(pca_result$rotation),
  PC1 = pca_result$rotation[,1],
  PC2 = pca_result$rotation[,2]
) %>%
  mutate(
    magnitude = sqrt(PC1^2 + PC2^2),
    label = str_replace(Taxa, "^([^_]+_[^_]+).*$", "\\1")
  ) %>%
  top_n(15, magnitude)  # Show top 15 contributing taxa

# Taxa contribution plot (biplot)
pca_loadings_plot <- ggplot(loadings_df) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey70") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey70") +
  geom_segment(aes(x = 0, y = 0, xend = PC1, yend = PC2),
               arrow = arrow(length = unit(0.2, "cm")),
               color = "grey50") +
  geom_text_repel(aes(x = PC1, y = PC2, label = label),
                  size = 3,
                  max.overlaps = Inf,
                  box.padding = 0.5) +
  theme_minimal() +
  labs(title = "Aitchison PCA - Taxa Contributions",
       x = "PC1 Loading",
       y = "PC2 Loading") +
  coord_fixed(ratio = 1)

Statistical Testing

We can test for differences between body sites using PERMANOVA:

1
2
3
4
5
6
7
8
# For Bray-Curtis distances
bc_permanova <- adonis2(bc_dist ~ metadata$HMP_BODY_SUBSITE,
                        permutations = 999)

# For CLR-transformed data
clr_dist <- dist(otu_table2, method = "euclidean")
clr_permanova <- adonis2(clr_dist ~ metadata2$HMP_BODY_SUBSITE,
                         permutations = 999)

Key Differences

  1. The CLR transformation preserves the relative relationships between taxa while making the data suitable for standard statistical methods
  2. With CLR-transformed data, we can create biplots that show both sample relationships and taxa contributions
  3. Zero handling is a crucial step in the CLR approach
  4. The CLR approach tends to be more stable when subsetting data or incorporating new taxa

comments powered by Disqus