2  Filtering rare ASVs and abundance transformations

In Block 1 we cleaned the phyloseq object by removing non-target sequences and contaminants. In this block we make two further decisions that prepare the data for diversity and composition analysis:

  1. Filtering rare ASVs : remove ASVs that contribute mainly noise. Two common approaches exist (singleton removal and prevalence filtering); the choice depends on the dataset and the downstream analyses.
  2. Abundance transformation : convert raw counts to relative abundances. This is the simplest transformation and the one needed for composition plots and Bray-Curtis distances. Other transformations (notably the centered log-ratio, CLR) will be introduced in Block 4 when we discuss compositionality and Aitchison distances.

A short note on what we are not doing: rarefaction. Rarefying samples to even sequencing depth is common in the literature, but for our analyses the methods we use either accept relative abundances directly (Bray-Curtis) or transform counts compositionally (Aitchison via CLR). We will not rarefy. The optional exercise in Block 3 lets you check whether this choice changes the alpha diversity conclusions on this dataset.

3 Setup

Code
library(phyloseq)
library(microbiome)
library(dplyr)
library(tibble)
library(ggplot2)

4 Load the cleaned phyloseq object from Block 1

Code
ps <- readRDS("phyloseq_clean.rds")
ps

5 Step 1: Filtering rare ASVs

5.1 Why filter rare ASVs at all?

After DADA2 and basic decontamination, a typical 16S dataset still contains ASVs that are present at very low abundance and in very few samples. These have several origins:

  1. Genuine rare community members at very low abundance.
  2. Cross-contamination between samples on the sequencing run (well-to-well leakage, index hopping).
  3. Surviving sequencing errors that were close enough to a real sequence to escape the DADA2 error model but distant enough to be called as distinct ASVs.

Some of these we want to keep (genuine rare members), some we want to remove (errors and noise). The challenge is that we cannot distinguish them from a single sequencing run — we have to pick a heuristic and apply it consistently.

5.2 Two common filtering strategies

There are two filters that come up repeatedly in the literature, and they make different trade-offs.

Singleton removal — the conservative option

Remove only ASVs with the lowest possible read support: those with exactly one read across the entire dataset, or some similarly low total. Singletons are very likely to be sequencing errors that escaped DADA2’s chimera filter, and removing them costs almost nothing biologically. This is the default we use in this workshop.

Prevalence filtering — the more aggressive option

Remove ASVs that are present in fewer than a threshold number of samples (e.g. fewer than 2 samples, or fewer than 10% of samples). This removes more ASVs and is appropriate when:

  1. You are doing differential abundance testing — rare-ASV DA tests inflate multiple-testing burden without contributing power.
  2. The dataset is large enough that “present in ≥10% of samples” still leaves many ASVs per group.
  3. The downstream method is destabilized by sparsity.

For small, heterogeneous datasets like ours (28 samples, with biological succession producing many transient ASVs), prevalence filtering can remove genuine biological variation. An ASV that appears in only one timepoint of one fermentation type may still be a real bloom, not noise.

5.3 Visualize prevalence vs abundance before filtering

A prevalence-vs-abundance plot is the clearest diagnostic. For each ASV we compute:

  1. Prevalence: the number of samples in which it has at least one read.
  2. Mean abundance: how abundant it is in the samples where it does occur.
Code
prev_df <- data.frame(
  ASV        = taxa_names(ps),
  Prevalence = microbiome::prevalence(ps, detection = 0, sort = FALSE, count = TRUE),
  TotalReads = taxa_sums(ps),
  Phylum     = as.character(tax_table(ps)[, "Phylum"])
) %>%
  mutate(
    PrevalenceFrac = Prevalence / nsamples(ps)
  )

ggplot(prev_df, aes(x = TotalReads, y = PrevalenceFrac, colour = Phylum)) +
  geom_point(alpha = 0.6) +
  scale_x_log10() +
  geom_vline(xintercept = 1.5, linetype = "solid",
             colour = "#bf4f2f") +
  geom_hline(yintercept = 0.10, linetype = "dashed",
             colour = "grey40") +
  labs(
    x = "Total reads across all samples (log scale)",
    y = "Prevalence (fraction of samples)"
  ) +
  theme_minimal()

The solid vertical line marks the singleton threshold (taxa_sums > 1) that we apply below. The dashed horizontal line marks where a 10% prevalence filter would cut, for reference. Most ASVs sit at very low prevalence — that’s characteristic of 16S data: a long tail of rare detections.

5.4 Apply singleton removal

For this workshop we use a singleton filter: remove any ASV whose total read count across the entire dataset is 1. Anything detected at least twice — even if those two reads are in the same sample — is kept.

Code
n_before <- ntaxa(ps)

ps_filt <- prune_taxa(taxa_sums(ps) > 1, ps)

n_after <- ntaxa(ps_filt)
message(sprintf(
  "Singleton filter: %d -> %d ASVs (%.1f%% retained)",
  n_before, n_after, 100 * n_after / n_before
))

# How many reads did we lose?
reads_before <- sum(otu_table(ps))
reads_after  <- sum(otu_table(ps_filt))
message(sprintf(
  "Reads retained: %.4f%%",
  100 * reads_after / reads_before
))

This filter typically removes a noticeable number of ASVs (singletons are common in 16S data) but retains essentially all of the read count. The ASVs we are removing are at the very bottom of the abundance distribution.

What if your data is destined for differential abundance testing? A stricter prevalence-based filter is appropriate. The most common choice is to require an ASV to be present in at least 10% of samples (in our case, 3 samples) before being tested. Apply this for the DA step only, not for diversity or composition analysis. The optional exercise at the end of this block walks through this alternative.

5.5 Document the filter you chose

Whichever filter you use, record the choice in your analysis notebook. The exact threshold matters less than the fact that you can defend it.

6 Step 2: Abundance transformation

6.1 Why transform?

Raw read counts cannot be compared directly between samples because sequencing depth varies. A sample with 50,000 reads will have higher counts for every ASV than a sample with 10,000 reads, regardless of any biological difference. We need to put the samples on a comparable scale before computing distances or making composition plots.

The simplest transformation is relative abundance (also called “compositional” or “total-sum scaling”): divide each count by the sample’s total, so every sample sums to 1. This is the right transformation when the question is “what fraction of this community is taxon X”. It is the standard input for stacked bar plots, Bray-Curtis distances, and most visualization.

Relative abundance is not the right transformation for every question. In particular, it does not address the compositional nature of the data: because the values must sum to 1, an apparent increase in one taxon could be caused by a real increase in that taxon, a real decrease in another, or both. The CLR transformation (introduced in Block 4) addresses this. For now, we just need relative abundances for the upcoming alpha diversity and composition work.

6.2 Other transformations available in microbiome::transform()

The microbiome package wraps several transformations in a single function. The most useful are:

  1. "compositional" — relative abundance, sums to 1
  2. "clr" — centered log-ratio; the compositional answer (Block 4)
  3. "hellinger" — square root of relative abundance; useful for some ordinations
  4. "log10" — log10(x) with a small pseudocount
  5. "Z" — z-score per taxon

We will use "compositional" here and "clr" in Block 4. The others are good to know about for your own work.

6.3 Apply the relative abundance transform

Code
ps_rel <- microbiome::transform(ps_filt, "compositional")

# Sanity check: every sample should now sum to 1
sample_sums(ps_rel) %>% summary()

The summary should show all values close to 1.0. Small deviations from exactly 1 are floating-point noise and not a problem.

7 Step 3: Save the filtered and transformed objects

We now have two phyloseq objects to carry forward:

  1. ps_filt — filtered counts, used wherever raw counts are needed (alpha diversity, anything that models count data directly)
  2. ps_rel — filtered relative abundances, used for composition plots and Bray-Curtis ordination
Code
saveRDS(ps_filt, file = "phyloseq_filtered.rds")
saveRDS(ps_rel,  file = "phyloseq_relative.rds")

8 Take-home points

  1. Filtering rare ASVs is necessary, but the right threshold depends on the dataset and the downstream analysis. Singleton removal is the conservative default; prevalence filtering is more aggressive and appropriate for differential abundance testing.
  2. In small heterogeneous datasets, aggressive prevalence filters can remove biologically real variation. Singleton removal preserves the tail of rare-but-detected taxa.
  3. Counts must be normalized before comparison across samples. Relative abundance is the simplest and most familiar choice, appropriate for visualization and Bray-Curtis distances.
  4. Different downstream methods need different inputs. Keep both the filtered count object and the relative-abundance object available; we will add a CLR-transformed version in Block 4.
  5. Rarefaction is not needed for any method we will use here. The optional exercise in Block 3 lets you check whether it changes the alpha diversity conclusions for this dataset.

9 Optional exercises

  1. Try a 10% prevalence filter as you would for differential abundance. Apply prune_taxa(microbiome::prevalence(ps_filt, detection = 0, sort = FALSE, count = TRUE) >= ceiling(0.1 * nsamples(ps_filt)), ps_filt) and compare. How many ASVs survive? Which dominant taxa are unchanged, and which rare ones disappear?
  2. Apply the Hellinger transformation to ps_filt. Inspect the resulting values: what range are they in, and how does it differ from raw relative abundances?