Code
library(phyloseq)
library(microbiome)
library(dplyr)
library(tidyr)
library(tibble)
library(ggplot2)
library(patchwork)Composition plots show what is in each sample. They are the most familiar microbiome visualization and the one most commonly seen in papers, but they are also the easiest to misuse. In this short block we will:
library(phyloseq)
library(microbiome)
library(dplyr)
library(tidyr)
library(tibble)
library(ggplot2)
library(patchwork)ps_rel <- readRDS("phyloseq_relative.rds")
ps_relASVs are useful for some analyses (alpha diversity, ordination at fine resolution) but rarely informative for visualization — there are too many of them and most readers cannot interpret an ASV identifier. For composition plots, aggregate to a level the audience can read.
microbiome::aggregate_taxa() collapses ASVs to the chosen level by summing their counts (or proportions, in our case).
ps_genus <- aggregate_taxa(ps_rel, level = "Genus")
ntaxa(ps_genus) # how many distinct genera in the dataset?If the number of genera is large (often more than 50), a stacked bar plot will be unreadable without further reduction. That is the next step.
A common convention is to keep the abundant or prevalent genera and group everything else as “Other”. The choice of threshold is a readability trade-off: too aggressive and the plot is dominated by Other; too permissive and the colour palette becomes uninterpretable.
The microbiome::aggregate_rare() function does this in one call. It keeps any taxon that exceeds both a minimum abundance threshold and a minimum prevalence threshold, and combines the rest into “Other”.
# Keep genera that reach at least 1% relative abundance in at least
# 10% of samples; group the rest as "Other"
ps_genus_top <- aggregate_rare(
ps_genus,
level = "Genus",
detection = 0.01, # 1% relative abundance
prevalence = 0.10 # in at least 10% of samples
)
ntaxa(ps_genus_top) # number of named genera + "Other"
taxa_names(ps_genus_top)The two thresholds answer different questions: detection filters out taxa that are uniformly low everywhere; prevalence filters out taxa that are abundant in only one or two samples. Adjust both depending on how busy your final plot is.
An alternative: top-N by abundance. If you prefer a fixed number of genera rather than thresholds, sort by
taxa_sums()and keep the top N, then re-aggregate.aggregate_rare()is more robust because it adapts to the actual distribution of your data, a 1% threshold is meaningful regardless of how many genera you have.
psmelt() produces a tidy long-format dataframe with one row per sample-taxon combination, with metadata attached. This is the simplest path to a ggplot-ready table.
comp_df <- psmelt(ps_genus_top)
head(comp_df)# Order genera so that "Other" sits at the bottom of each bar
genus_levels <- c(
"Other",
setdiff(unique(comp_df$OTU), "Other")
)
comp_df <- comp_df %>%
mutate(OTU = factor(OTU, levels = genus_levels))
part1 <- comp_df %>%
filter(!Fermentation %in% c("Commercial", "Control")) %>%
ggplot(aes(x = Replicate, y = Abundance, fill = OTU)) +
geom_col() +
facet_grid(Time_point ~ Fermentation,
scales = "free_x", space = "free_x") +
labs(x = NULL, y = "Relative abundance", fill = "Genus") +
theme_bw() +
theme(
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
legend.position = "right"
)
part2 <- comp_df %>%
filter(Fermentation %in% c("Commercial", "Control")) %>%
ggplot(aes(x = Replicate, y = Abundance, fill = OTU)) +
geom_col() +
facet_grid(. ~ Fermentation,
scales = "free_x", space = "free_x") +
labs(x = NULL, y = "Relative abundance", fill = "Genus") +
theme_bw() +
theme(
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
legend.position = "right"
)
part1 + part2 +
plot_layout(guides='collect') &
theme(legend.position='bottom')Why we use
OTUas the fill column.psmelt()always names the taxon columnOTU, regardless of what taxonomic level the object is aggregated to. Afteraggregate_rare()to genus level, the values in this column are genus names plus “Other”. The relabelling to “Genus” in the legend (fill = "Genus") makes the plot readable.
In this dataset the Lactobacillus-inoculated panel should be dominated by a single Lactiplantibacillus bar at every time point, the spontaneous panel should show a more diverse community at H24 that progressively narrows across time, and the controls should look quite different from both.
Stacked bar plots have three important blind spots:
The plot above sidesteps the third by showing all replicates rather than averaging. The first two are inherent to the format. When you need to compare a specific taxon across samples, a different visualization is more honest.
A heatmap shows the same data with each taxon-by-sample combination as a coloured cell. It makes it much easier to compare individual taxa across samples — the taxon stays in the same row throughout — at the cost of looking less “intuitive” to a non-specialist audience.
ggplot(
comp_df %>% filter(OTU != "Other"),
aes(x = Sample, y = OTU, fill = Abundance)
) +
geom_tile() +
facet_grid(~ Time_point,
scales = "free_x", space = "free_x") +
scale_fill_viridis_c(
trans = "sqrt",
name = "Relative\nabundance"
) +
labs(x = NULL, y = NULL) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))Two practical notes on the heatmap:
ComplexHeatmap package is the standard tool. The geom_tile approach above is enough for an exploratory look.microbiome::aggregate_rare() with abundance and prevalence thresholds to define what counts as “Other”. The two thresholds filter different things: low-everywhere taxa, and abundant-but-rare taxa.aggregate_rare() thresholds. What happens with detection = 0.05, prevalence = 0.20 (stricter)? What about detection = 0.001, prevalence = 0.05 (more permissive)?