5  Composition visualisation

Composition plots show what is in each sample. They are the most familiar microbiome visualization and the one most commonly seen in papers, but they are also the easiest to misuse. In this short block we will:

  1. Aggregate the data to a more interpretable taxonomic level (genus).
  2. Identify the dominant taxa and group the rest as “Other”.
  3. Draw a stacked bar plot.
  4. Discuss what stacked bars do and do not show, and look at one alternative (a heatmap).

6 Setup

Code
library(phyloseq)
library(microbiome)
library(dplyr)
library(tidyr)
library(tibble)
library(ggplot2)
library(patchwork)
Code
ps_rel <- readRDS("phyloseq_relative.rds")
ps_rel

7 Step 1: Aggregate to genus level

ASVs are useful for some analyses (alpha diversity, ordination at fine resolution) but rarely informative for visualization — there are too many of them and most readers cannot interpret an ASV identifier. For composition plots, aggregate to a level the audience can read.

microbiome::aggregate_taxa() collapses ASVs to the chosen level by summing their counts (or proportions, in our case).

Code
ps_genus <- aggregate_taxa(ps_rel, level = "Genus")

ntaxa(ps_genus)  # how many distinct genera in the dataset?

If the number of genera is large (often more than 50), a stacked bar plot will be unreadable without further reduction. That is the next step.

8 Step 2: Group rare genera into “Other”

A common convention is to keep the abundant or prevalent genera and group everything else as “Other”. The choice of threshold is a readability trade-off: too aggressive and the plot is dominated by Other; too permissive and the colour palette becomes uninterpretable.

The microbiome::aggregate_rare() function does this in one call. It keeps any taxon that exceeds both a minimum abundance threshold and a minimum prevalence threshold, and combines the rest into “Other”.

Code
# Keep genera that reach at least 1% relative abundance in at least
# 10% of samples; group the rest as "Other"
ps_genus_top <- aggregate_rare(
  ps_genus,
  level     = "Genus",
  detection = 0.01,   # 1% relative abundance
  prevalence = 0.10   # in at least 10% of samples
)

ntaxa(ps_genus_top)  # number of named genera + "Other"
taxa_names(ps_genus_top)

The two thresholds answer different questions: detection filters out taxa that are uniformly low everywhere; prevalence filters out taxa that are abundant in only one or two samples. Adjust both depending on how busy your final plot is.

An alternative: top-N by abundance. If you prefer a fixed number of genera rather than thresholds, sort by taxa_sums() and keep the top N, then re-aggregate. aggregate_rare() is more robust because it adapts to the actual distribution of your data, a 1% threshold is meaningful regardless of how many genera you have.

9 Step 3: Build the long-format dataframe for plotting

psmelt() produces a tidy long-format dataframe with one row per sample-taxon combination, with metadata attached. This is the simplest path to a ggplot-ready table.

Code
comp_df <- psmelt(ps_genus_top)
head(comp_df)

10 Step 4: The stacked bar plot

Code
# Order genera so that "Other" sits at the bottom of each bar
genus_levels <- c(
  "Other",
  setdiff(unique(comp_df$OTU), "Other")
)
comp_df <- comp_df %>%
  mutate(OTU = factor(OTU, levels = genus_levels)) 

part1 <- comp_df %>%
  filter(!Fermentation %in% c("Commercial", "Control")) %>% 
  ggplot(aes(x = Replicate, y = Abundance, fill = OTU)) +
  geom_col() +
  facet_grid(Time_point ~ Fermentation,
             scales = "free_x", space = "free_x") +
  labs(x = NULL, y = "Relative abundance", fill = "Genus") +
  theme_bw() +
  theme(
    axis.text.x  = element_text(angle = 90, hjust = 1, vjust = 0.5),
    legend.position = "right"
  )

part2 <- comp_df %>%
  filter(Fermentation %in% c("Commercial", "Control")) %>% 
  ggplot(aes(x = Replicate, y = Abundance, fill = OTU)) +
  geom_col() +
  facet_grid(. ~ Fermentation,
             scales = "free_x", space = "free_x") +
  labs(x = NULL, y = "Relative abundance", fill = "Genus") +
  theme_bw() +
  theme(
    axis.text.x  = element_text(angle = 90, hjust = 1, vjust = 0.5),
    legend.position = "right"
  )

part1 + part2 +
  plot_layout(guides='collect') &
  theme(legend.position='bottom')

Why we use OTU as the fill column. psmelt() always names the taxon column OTU, regardless of what taxonomic level the object is aggregated to. After aggregate_rare() to genus level, the values in this column are genus names plus “Other”. The relabelling to “Genus” in the legend (fill = "Genus") makes the plot readable.

In this dataset the Lactobacillus-inoculated panel should be dominated by a single Lactiplantibacillus bar at every time point, the spontaneous panel should show a more diverse community at H24 that progressively narrows across time, and the controls should look quite different from both.

11 Step 5: What stacked bars do and do not show

Stacked bar plots have three important blind spots:

  1. They make it hard to compare individual taxa across samples. The eye reads the position of the colour band, but the band’s vertical position depends on what is below it — so the same taxon at the same abundance can appear at different heights in different samples.
  2. They flatten low-abundance taxa. Anything below a few percent becomes a thin sliver indistinguishable from noise in the plot, even if it is biologically important.
  3. They do not show variation within a group. With several replicates per condition, a stacked bar per replicate is just visual repetition; a stacked bar showing group means hides the variance entirely.

The plot above sidesteps the third by showing all replicates rather than averaging. The first two are inherent to the format. When you need to compare a specific taxon across samples, a different visualization is more honest.

12 Step 6: A heatmap as an alternative

A heatmap shows the same data with each taxon-by-sample combination as a coloured cell. It makes it much easier to compare individual taxa across samples — the taxon stays in the same row throughout — at the cost of looking less “intuitive” to a non-specialist audience.

Code
ggplot(
    comp_df %>% filter(OTU != "Other"),
    aes(x = Sample, y = OTU, fill = Abundance)
  ) +
  geom_tile() +
  facet_grid(~ Time_point,
             scales = "free_x", space = "free_x") +
  scale_fill_viridis_c(
    trans = "sqrt",
    name  = "Relative\nabundance"
  ) +
  labs(x = NULL, y = NULL) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

Two practical notes on the heatmap:

  1. The square-root scale on the colour axis matters. Microbiome abundances are heavily right-skewed: a few taxa dominate, most are rare. On a linear colour scale almost everything looks the same pale shade. Square-root or log scales spread the colour proportionally to perception.
  2. For a publication-quality heatmap with row/column dendrograms and annotation tracks, the ComplexHeatmap package is the standard tool. The geom_tile approach above is enough for an exploratory look.

13 Take-home points

  1. Aggregate to a readable taxonomic level before plotting composition. ASV-level stacked bars are unreadable.
  2. Use microbiome::aggregate_rare() with abundance and prevalence thresholds to define what counts as “Other”. The two thresholds filter different things: low-everywhere taxa, and abundant-but-rare taxa.
  3. Stacked bars are familiar but flatten variation. When the question is “did this specific taxon change?”, a heatmap or a per-taxon point plot is more honest.
  4. Match the plot to the question. Composition plots answer “what is in this sample”; ordinations answer “how do samples differ overall”; differential abundance methods answer “which taxa drive the difference”.

14 Optional exercises

  1. Generate the composition plot at family level instead of genus. Which level tells the clearer story for this dataset?
  2. Try different aggregate_rare() thresholds. What happens with detection = 0.05, prevalence = 0.20 (stricter)? What about detection = 0.001, prevalence = 0.05 (more permissive)?
  3. Plot relative abundance of Lactiplantibacillus alone across all samples as a faceted point plot (one panel per fermentation type, x-axis Time_point, y-axis abundance). Compare what this view shows that the stacked bar does not.