1 From DADA2 outputs to a clean phyloseq object

In this first block we will take the raw outputs of DADA2 and construct a phyloseq object that is ready for downstream analysis. The work is split into three logical steps:

Construction — assemble the count table, taxonomy table, and sample metadata into a single phyloseq object.
Quality assessment and cleaning — investigate the taxonomic annotation, remove non-target sequences (chloroplasts, mitochondria, poorly classified ASVs), and filter contaminants identified from the extraction blank.
Saving — write the cleaned object to disk as an .rds file so that subsequent blocks can start from a clean, reproducible state.

The dataset we will use comes from a study of Romanian-style cucumber fermentations comparing spontaneous fermentation with inoculation by Lactiplantibacillus plantarum IBB082, with sampling at four time points.

2 Setup

Code

library(phyloseq)
library(microbiome)
library(dplyr)
library(tidyr)
library(tibble)
library(ggplot2)
library(readr)

3 Step 1: Loading the DADA2 outputs

The DADA2 pipeline produced three files for this workshop:

metabarcoding_dada2_asvtax.csv : combined ASV table containing both the taxonomy (Kingdom through Genus) and the count matrix (one column per sample). Rows are ASVs.
metabarcoding_dada2_QC.csv : read-tracking statistics from each step of the DADA2 pipeline (input, filtered, denoised, merged, non-chimeric). Rows are samples.
metabarcoding_metadata_2026.tsv : sample metadata. This is the ENA submission sheet, so it has many columns; we will keep only what we need.

Note. The structure of the ASV file — taxonomy and counts in one wide table — is one common format. Other DADA2 workflows produce a separate matrix and a separate taxonomy table. The first thing to do with any new dataset is read the file and figure out what each column represents.

Code

# Combined taxonomy + count table from DADA2.
# The first column has the ASV identifier (row name).
asvtax <- read_csv("Inputs/metabarcoding_dada2_asvtax.csv") %>%
  rename(ASV_ID = `...1`)

# DADA2 read-tracking statistics
stats <- read_csv("Inputs/metabarcoding_dada2_QC.csv") %>%
  rename(sample_alias = `...1`)

# Sample metadata (ENA submission sheet)
metadata <- read_tsv("Inputs/metabarcoding_metadata_2026.tsv")

3.1 Inspect the inputs before doing anything else

It is good practice to check the dimensions and structure of each table before combining them. Mismatched sample names between the count table and the metadata are the single most common reason phyloseq object construction fails.

Code

# How many ASVs and how wide is the table?
dim(asvtax)
colnames(asvtax)[1:10]   # first few columns: ASV_ID + 6 taxonomy columns + samples

# How many samples in the QC and metadata files?
dim(stats)
dim(metadata)

# Are the sample names consistent across files?
sample_cols <- setdiff(colnames(asvtax),
                       c("ASV_ID", "Kingdom", "Phylum", "Class", "Order",
                         "Family", "Genus"))
all(sample_cols %in% metadata$sample_alias)
all(sample_cols %in% stats$sample_alias)

3.2 Split the combined file into a count matrix and a taxonomy matrix

Code

# Taxonomy: ASV_ID + 6 taxonomic ranks
taxa <- asvtax %>%
  select(ASV_ID, Kingdom, Phylum, Class, Order, Family, Genus) %>%
  column_to_rownames("ASV_ID")

# Count matrix: ASV_ID + sample columns. We want samples as rows for
# phyloseq, so we transpose after extracting.
counts <- asvtax %>%
  select(-Kingdom, -Phylum, -Class, -Order, -Family, -Genus) %>%
  column_to_rownames("ASV_ID")  # rows = ASVs

# Transpose so rows are samples
seqtab <- t(as.matrix(counts))
dim(seqtab)        # samples x ASVs

3.3 Build a clean metadata table with the variables we need

The metadata file has 25 columns of ENA submission information. For our analysis we need only a few. We will also derive two variables — Fermentation and Time_point — from the structured sample_alias field (e.g. Lactobacillus_H24_R1 → Fermentation = “Lactobacillus”, Time_point = “H24”, Replicate = “R1”).

Code

metadata_clean <- metadata %>%
  select(sample_alias, sample_description) %>%
  separate(sample_alias,
           into  = c("Fermentation", "Time_point", "Replicate"),
           sep   = "_",
           remove = FALSE,
           fill  = "right") %>%
  # Fix the non-standard names: for Commercial_R* and Control_*
  mutate(
    Replicate  = case_when(
      Fermentation == "Commercial"               ~ Time_point,
      Fermentation == "Control"                  ~ NA_character_,
      TRUE                                       ~ Replicate
    ),
    Time_point = case_when(
      Fermentation %in% c("Commercial", "Control") ~ NA_character_,
      TRUE                                         ~ Time_point
    )
  ) %>%
  left_join(stats, by = "sample_alias") %>%
  column_to_rownames("sample_alias")

head(metadata_clean)

What separate() does. tidyr::separate() splits a single column into several based on a delimiter. Here we split the structured sample_alias (e.g. Lactobacillus_H24_R1) on the underscore, into Fermentation, Time_point, and Replicate. The Control_extraction and Control_fermentation samples have only two parts so the Replicate column is NA for them.

4 Step 2: Building the phyloseq object

A phyloseq object combines up to four components: an OTU/ASV table, a taxonomy table, sample metadata, and (optionally) a phylogenetic tree. We will build it from the first three; this dataset does not include a tree, which means we will not be able to compute phylogenetic distances such as UniFrac. Bray-Curtis and Aitchison distances (covered in Block 4) do not require a tree.

Code

# Convert each component to the phyloseq class it expects.
otu  <- otu_table(seqtab, taxa_are_rows = FALSE)
tax  <- tax_table(as.matrix(taxa))
samp <- sample_data(metadata_clean)

# Combine into a single phyloseq object.
ps <- phyloseq(otu, tax, samp)
ps

4.1 Initial inspection of the phyloseq object

Code

# How many ASVs and samples?
ntaxa(ps)
nsamples(ps)

# Distribution of read depth across samples
sample_sums(ps) %>% summary()

# Plot the read depth — useful for spotting failed samples
microbiome::meta(ps) %>%
  rownames_to_column("sample_alias") %>%
  rename("Reads"="nonchim") %>%
  ggplot(aes(x = reorder(sample_alias, Reads), y = Reads)) +
  geom_col() +
  coord_flip() +
  labs(x = NULL, y = "Total reads") +
  theme_minimal()

This plot makes it immediately obvious which samples have very different sequencing depths. We expect to see at least one extraction blank with a much lower read count than the biological samples, and a couple of commercial control samples that came in with much lower input.

5 Step 3: Quality assessment and cleaning

We now apply a series of conservative filters. Each filter is justified by a specific quality concern:

ASVs unclassified at high taxonomic ranks are likely sequencing artefacts or chimeras that escaped the DADA2 chimera filter, or off-target amplification.
Chloroplast and mitochondrial 16S sequences are abundant in plant- and animal-derived samples (cucumbers, in our case) but are not bacterial community members.
Low-biomass samples such as extraction blanks pick up contaminating DNA from kits and the lab environment. ASVs that appear at non-trivial read counts in the blanks should be considered suspect in the biological samples.

The order matters: we do the taxonomy-based filters first, then the contamination filter, so that the contamination assessment happens on a cleaner ASV set.

5.1 3a. Assess the quality of the taxonomic annotation

Before filtering, it is worth looking at how complete the taxonomic assignments are at each rank. A high proportion of NA at the Kingdom or Phylum level usually indicates either a database mismatch (wrong reference, wrong region) or sequencing noise.

Code

tax_df <- as.data.frame(tax_table(ps))

# Proportion of ASVs unclassified (NA) at each rank
tax_completeness <- sapply(tax_df, function(x) mean(is.na(x) | x == ""))
tax_completeness

# Same as a plot
data.frame(
  Rank = factor(names(tax_completeness), levels = names(tax_completeness)),
  PercentUnclassified = 100 * tax_completeness
) %>%
  ggplot(aes(x = Rank, y = PercentUnclassified)) +
  geom_col() +
  labs(y = "% of ASVs unclassified", x = NULL) +
  theme_minimal()

What we are looking for: classification rates should be near 100% at Kingdom and Phylum, drop progressively at lower ranks, and end up somewhere in the 70–90% range at Genus for a well-sequenced 16S dataset. Numbers far below this (e.g. only 50% classified at Phylum) suggest the data has problems.

5.2 3b. Remove ASVs unclassified at the Order level

Following the approach used in the original study, we remove ASVs that the classifier could not assign to an Order. These are usually short or spurious sequences that survived chimera removal. Keeping them would add noise to all downstream analyses without contributing biological signal.

Code

n_before <- ntaxa(ps)

ps <- subset_taxa(ps, !is.na(Order))

n_after <- ntaxa(ps)
sprintf(
  "Removed %d ASVs unclassified at Order level (%d -> %d)",
  n_before - n_after, n_before, n_after
)

Optional exercise. Try the same filter at the Phylum level instead of Order. How many additional ASVs survive? Does that change the conclusions later? There is no single right answer — the threshold depends on the trade-off between including ambiguous data and discarding real signal.

5.3 3c. Remove chloroplast and mitochondrial sequences

The 16S rRNA gene is also present in chloroplasts and mitochondria, and the universal primers used in 16S sequencing amplify them readily from any plant or animal material in the sample. For a cucumber fermentation, plant DNA is unavoidable. SILVA classifies these as:

Chloroplasts: Order == "Chloroplast"
Mitochondria: Family == "Mitochondria"

Code

# Pull the taxonomy as a data frame
tax_df2 <- as.data.frame(tax_table(ps))

# Identify chloroplast and mitochondrial ASVs by name
chloro_asvs <- tax_df %>% filter(Order=="Chloroplast")%>% rownames()
mito_asvs   <- tax_df %>% filter(Family=="Mitochondria")%>% rownames()

length(chloro_asvs)
length(mito_asvs)

# What proportion of total reads do they represent?
sum(otu_table(ps)[, chloro_asvs]) / sum(otu_table(ps))
sum(otu_table(ps)[, mito_asvs])   / sum(otu_table(ps))

# Remove them
n_before <- ntaxa(ps)
ps <- prune_taxa(!taxa_names(ps) %in% c(chloro_asvs, mito_asvs), ps)
n_after <- ntaxa(ps)

sprintf(
  "Removed %d chloroplast/mitochondrial ASVs (%d -> %d)",
  n_before - n_after, n_before, n_after
)

The fraction of reads lost to organellar 16S can be substantial in plant matrix samples — sometimes more than half of all reads. If you see that, it is not necessarily a problem (it just means the cucumber DNA was abundant), but it is essential to remove these sequences before any diversity analysis. Otherwise a sample with more cucumber DNA will look artificially different from one with less.

5.4 3d. Identify and remove contaminants using the extraction blank

The extraction blank is processed alongside the samples but contains no biological material. Any ASV with a meaningful number of reads in the blank was introduced during DNA extraction or PCR, and should be treated with suspicion in the biological samples.

Why we are using a manual threshold rather than the decontam package. decontam provides statistical methods (prevalence-based and frequency-based) to identify contaminants, and is the recommended tool when blanks have sufficient read depth. In this dataset the extraction blank had a low total read count (~1,400 reads), which limits the statistical power of decontam’s methods. A conservative manual exclusion of ASVs that are nevertheless prevalent in the blank is a defensible alternative in this case. For datasets with deeper blanks or more complex contamination patterns, decontam is a better choice.

Code

# Identify the blank sample. The metadata Fermentation field is
# "Control" for both Control_extraction and Control_fermentation, so
# we filter on the row name (sample_alias).
blank_id <- "Control_extraction"

# Extract the blank's read counts as a named numeric vector
blank_counts <- as.numeric(otu_table(ps)[blank_id, ])
names(blank_counts) <- taxa_names(ps)

# How many ASVs have non-zero counts in the blank?
sum(blank_counts > 0)

# Inspect the most abundant ASVs in the blank
sort(blank_counts[blank_counts > 0], decreasing = TRUE) %>% head(20)

# Look at the taxonomy of those ASVs
top_blank_asvs <- names(sort(blank_counts, decreasing = TRUE))[1:20]
tax_table(ps)[top_blank_asvs, ]

This is often illuminating. Common kit and lab contaminants include Cutibacterium, Ralstonia, Corynebacterium, Sphingomonas and a handful of other taxa that are ubiquitous in DNA extraction kits. If you see these dominating the blank, the result is consistent with reagent-derived contamination.

Code

# Threshold: ASVs with more than 100 reads in the blank are flagged as
# contaminants. This is a conservative, dataset-specific choice — see
# below.
contaminant_threshold <- 100
contaminants <- names(blank_counts)[blank_counts > contaminant_threshold]
length(contaminants)
contaminants

# Remove the flagged ASVs from the entire dataset
n_before <- ntaxa(ps)
ps <- subset_taxa(ps, !taxa_names(ps) %in% contaminants)
n_after <- ntaxa(ps)

sprintf(
  "Removed %d contaminant ASVs based on blank (%d -> %d)",
  n_before - n_after, n_before, n_after
)

About the threshold. The choice of 100 reads is not magic. It is a conservative cut-off appropriate to this dataset, where the blank had a low total read depth and any ASV exceeding 100 reads in it is genuinely suspect. With a deeper blank, a relative threshold (e.g. ASVs whose read count in the blank exceeds X% of their total across all samples) is often more appropriate. Whatever you choose, document it.

5.5 3e. Remove the blank and the unfermented control from the analysis

The blank has served its purpose for contaminant flagging and should now be removed. The unfermented brine control (Control_fermentation) contains the natural starting microbiota of the raw materials — it is a useful reference but is not part of the fermentation comparison.

Code

# Remove the extraction blank
ps <- subset_samples(ps, sample_names(ps) != "Control_extraction")

# Drop the two commercial samples that failed sequencing
ps <- subset_samples(ps, !sample_names(ps) %in% c("Commercial_R1", "Commercial_R3"))

# After removing samples, some ASVs may now have zero reads everywhere.
ps <- prune_taxa(taxa_sums(ps) > 0, ps)

ps

Note on the surviving reference samples. Two of the three commercial replicates (R1 and R3) failed sequencing and were dropped above on QC grounds. The third replicate (Commercial_R2) and the unfermented brine baseline (Control_fermentation) are kept in the object — but with a specific role. They are reference samples, not part of the fermentation comparison: the commercial sample shows what an industrial product looks like, and the unfermented control shows the starting community before fermentation begins. We will use them as visual reference points on the ordinations and composition plots, but we will exclude them from any formal statistical test: with n = 1 there is no within-group variance for a centroid-based test to read.

6 Step 4: Final summary and save

A short summary of what came out of the cleaning is helpful to record before moving on.

Code

ps

# Sample composition
table(sample_data(ps)$Fermentation, sample_data(ps)$Time_point, useNA = "ifany")

# Read-depth distribution after cleaning
sample_sums(ps) %>% summary()

# Phylum-level breakdown of remaining ASVs
table(tax_table(ps)[, "Phylum"], useNA = "ifany")

Code

# Save the cleaned phyloseq object for use in subsequent blocks.
saveRDS(ps, file = "phyloseq_clean.rds")

In the next block we will load this .rds, apply singleton filtering and abundance transformations.

7 Take-home points

The DADA2 output is not yet biological data. Several rounds of filtering (taxonomic, organellar, contamination) should happen before any diversity statistic is computed.
The order of filters is deliberate: taxonomy-based first (because it touches the most ASVs), organellar second, then blank-based contamination on the cleaner set.
Blanks are an investment, not overhead. They tell you which ASVs to distrust, which is information that no statistical method can recover later from biological samples alone.
Document every threshold you choose and the reasoning behind it. The thresholds appropriate for this dataset will not be the right ones for yours.

8 Optional exercises

Try a stricter taxonomic filter (require classification at Family level rather than Order). How many ASVs are dropped?