Code
library(phyloseq)
library(microbiome)
library(dplyr)
library(tidyr)
library(tibble)
library(ggplot2)
library(readr)In this first block we will take the raw outputs of DADA2 and construct a phyloseq object that is ready for downstream analysis. The work is split into three logical steps:
phyloseq object..rds file so that subsequent blocks can start from a clean, reproducible state.The dataset we will use comes from a study of Romanian-style cucumber fermentations comparing spontaneous fermentation with inoculation by Lactiplantibacillus plantarum IBB082, with sampling at four time points.
library(phyloseq)
library(microbiome)
library(dplyr)
library(tidyr)
library(tibble)
library(ggplot2)
library(readr)The DADA2 pipeline produced three files for this workshop:
metabarcoding_dada2_asvtax.csv : combined ASV table containing both the taxonomy (Kingdom through Genus) and the count matrix (one column per sample). Rows are ASVs.metabarcoding_dada2_QC.csv : read-tracking statistics from each step of the DADA2 pipeline (input, filtered, denoised, merged, non-chimeric). Rows are samples.metabarcoding_metadata_2026.tsv : sample metadata. This is the ENA submission sheet, so it has many columns; we will keep only what we need.Note. The structure of the ASV file — taxonomy and counts in one wide table — is one common format. Other DADA2 workflows produce a separate matrix and a separate taxonomy table. The first thing to do with any new dataset is read the file and figure out what each column represents.
# Combined taxonomy + count table from DADA2.
# The first column has the ASV identifier (row name).
asvtax <- read_csv("Inputs/metabarcoding_dada2_asvtax.csv") %>%
rename(ASV_ID = `...1`)
# DADA2 read-tracking statistics
stats <- read_csv("Inputs/metabarcoding_dada2_QC.csv") %>%
rename(sample_alias = `...1`)
# Sample metadata (ENA submission sheet)
metadata <- read_tsv("Inputs/metabarcoding_metadata_2026.tsv")It is good practice to check the dimensions and structure of each table before combining them. Mismatched sample names between the count table and the metadata are the single most common reason phyloseq object construction fails.
# How many ASVs and how wide is the table?
dim(asvtax)
colnames(asvtax)[1:10] # first few columns: ASV_ID + 6 taxonomy columns + samples
# How many samples in the QC and metadata files?
dim(stats)
dim(metadata)
# Are the sample names consistent across files?
sample_cols <- setdiff(colnames(asvtax),
c("ASV_ID", "Kingdom", "Phylum", "Class", "Order",
"Family", "Genus"))
all(sample_cols %in% metadata$sample_alias)
all(sample_cols %in% stats$sample_alias)# Taxonomy: ASV_ID + 6 taxonomic ranks
taxa <- asvtax %>%
select(ASV_ID, Kingdom, Phylum, Class, Order, Family, Genus) %>%
column_to_rownames("ASV_ID")
# Count matrix: ASV_ID + sample columns. We want samples as rows for
# phyloseq, so we transpose after extracting.
counts <- asvtax %>%
select(-Kingdom, -Phylum, -Class, -Order, -Family, -Genus) %>%
column_to_rownames("ASV_ID") # rows = ASVs
# Transpose so rows are samples
seqtab <- t(as.matrix(counts))
dim(seqtab) # samples x ASVsThe metadata file has 25 columns of ENA submission information. For our analysis we need only a few. We will also derive two variables — Fermentation and Time_point — from the structured sample_alias field (e.g. Lactobacillus_H24_R1 → Fermentation = “Lactobacillus”, Time_point = “H24”, Replicate = “R1”).
metadata_clean <- metadata %>%
select(sample_alias, sample_description) %>%
separate(sample_alias,
into = c("Fermentation", "Time_point", "Replicate"),
sep = "_",
remove = FALSE,
fill = "right") %>%
# Fix the non-standard names: for Commercial_R* and Control_*
mutate(
Replicate = case_when(
Fermentation == "Commercial" ~ Time_point,
Fermentation == "Control" ~ NA_character_,
TRUE ~ Replicate
),
Time_point = case_when(
Fermentation %in% c("Commercial", "Control") ~ NA_character_,
TRUE ~ Time_point
)
) %>%
left_join(stats, by = "sample_alias") %>%
column_to_rownames("sample_alias")
head(metadata_clean)What
separate()does.tidyr::separate()splits a single column into several based on a delimiter. Here we split the structuredsample_alias(e.g.Lactobacillus_H24_R1) on the underscore, intoFermentation,Time_point, andReplicate. TheControl_extractionandControl_fermentationsamples have only two parts so theReplicatecolumn isNAfor them.
A phyloseq object combines up to four components: an OTU/ASV table, a taxonomy table, sample metadata, and (optionally) a phylogenetic tree. We will build it from the first three; this dataset does not include a tree, which means we will not be able to compute phylogenetic distances such as UniFrac. Bray-Curtis and Aitchison distances (covered in Block 4) do not require a tree.
# Convert each component to the phyloseq class it expects.
otu <- otu_table(seqtab, taxa_are_rows = FALSE)
tax <- tax_table(as.matrix(taxa))
samp <- sample_data(metadata_clean)
# Combine into a single phyloseq object.
ps <- phyloseq(otu, tax, samp)
ps# How many ASVs and samples?
ntaxa(ps)
nsamples(ps)
# Distribution of read depth across samples
sample_sums(ps) %>% summary()
# Plot the read depth — useful for spotting failed samples
microbiome::meta(ps) %>%
rownames_to_column("sample_alias") %>%
rename("Reads"="nonchim") %>%
ggplot(aes(x = reorder(sample_alias, Reads), y = Reads)) +
geom_col() +
coord_flip() +
labs(x = NULL, y = "Total reads") +
theme_minimal()This plot makes it immediately obvious which samples have very different sequencing depths. We expect to see at least one extraction blank with a much lower read count than the biological samples, and a couple of commercial control samples that came in with much lower input.
We now apply a series of conservative filters. Each filter is justified by a specific quality concern:
The order matters: we do the taxonomy-based filters first, then the contamination filter, so that the contamination assessment happens on a cleaner ASV set.
Before filtering, it is worth looking at how complete the taxonomic assignments are at each rank. A high proportion of NA at the Kingdom or Phylum level usually indicates either a database mismatch (wrong reference, wrong region) or sequencing noise.
tax_df <- as.data.frame(tax_table(ps))
# Proportion of ASVs unclassified (NA) at each rank
tax_completeness <- sapply(tax_df, function(x) mean(is.na(x) | x == ""))
tax_completeness
# Same as a plot
data.frame(
Rank = factor(names(tax_completeness), levels = names(tax_completeness)),
PercentUnclassified = 100 * tax_completeness
) %>%
ggplot(aes(x = Rank, y = PercentUnclassified)) +
geom_col() +
labs(y = "% of ASVs unclassified", x = NULL) +
theme_minimal()What we are looking for: classification rates should be near 100% at Kingdom and Phylum, drop progressively at lower ranks, and end up somewhere in the 70–90% range at Genus for a well-sequenced 16S dataset. Numbers far below this (e.g. only 50% classified at Phylum) suggest the data has problems.
Following the approach used in the original study, we remove ASVs that the classifier could not assign to an Order. These are usually short or spurious sequences that survived chimera removal. Keeping them would add noise to all downstream analyses without contributing biological signal.
n_before <- ntaxa(ps)
ps <- subset_taxa(ps, !is.na(Order))
n_after <- ntaxa(ps)
sprintf(
"Removed %d ASVs unclassified at Order level (%d -> %d)",
n_before - n_after, n_before, n_after
)Optional exercise. Try the same filter at the Phylum level instead of Order. How many additional ASVs survive? Does that change the conclusions later? There is no single right answer — the threshold depends on the trade-off between including ambiguous data and discarding real signal.
The 16S rRNA gene is also present in chloroplasts and mitochondria, and the universal primers used in 16S sequencing amplify them readily from any plant or animal material in the sample. For a cucumber fermentation, plant DNA is unavoidable. SILVA classifies these as:
Order == "Chloroplast"Family == "Mitochondria"# Pull the taxonomy as a data frame
tax_df2 <- as.data.frame(tax_table(ps))
# Identify chloroplast and mitochondrial ASVs by name
chloro_asvs <- tax_df %>% filter(Order=="Chloroplast")%>% rownames()
mito_asvs <- tax_df %>% filter(Family=="Mitochondria")%>% rownames()
length(chloro_asvs)
length(mito_asvs)
# What proportion of total reads do they represent?
sum(otu_table(ps)[, chloro_asvs]) / sum(otu_table(ps))
sum(otu_table(ps)[, mito_asvs]) / sum(otu_table(ps))
# Remove them
n_before <- ntaxa(ps)
ps <- prune_taxa(!taxa_names(ps) %in% c(chloro_asvs, mito_asvs), ps)
n_after <- ntaxa(ps)
sprintf(
"Removed %d chloroplast/mitochondrial ASVs (%d -> %d)",
n_before - n_after, n_before, n_after
)The fraction of reads lost to organellar 16S can be substantial in plant matrix samples — sometimes more than half of all reads. If you see that, it is not necessarily a problem (it just means the cucumber DNA was abundant), but it is essential to remove these sequences before any diversity analysis. Otherwise a sample with more cucumber DNA will look artificially different from one with less.
The extraction blank is processed alongside the samples but contains no biological material. Any ASV with a meaningful number of reads in the blank was introduced during DNA extraction or PCR, and should be treated with suspicion in the biological samples.
Why we are using a manual threshold rather than the
decontampackage.decontamprovides statistical methods (prevalence-based and frequency-based) to identify contaminants, and is the recommended tool when blanks have sufficient read depth. In this dataset the extraction blank had a low total read count (~1,400 reads), which limits the statistical power ofdecontam’s methods. A conservative manual exclusion of ASVs that are nevertheless prevalent in the blank is a defensible alternative in this case. For datasets with deeper blanks or more complex contamination patterns,decontamis a better choice.
# Identify the blank sample. The metadata Fermentation field is
# "Control" for both Control_extraction and Control_fermentation, so
# we filter on the row name (sample_alias).
blank_id <- "Control_extraction"
# Extract the blank's read counts as a named numeric vector
blank_counts <- as.numeric(otu_table(ps)[blank_id, ])
names(blank_counts) <- taxa_names(ps)
# How many ASVs have non-zero counts in the blank?
sum(blank_counts > 0)
# Inspect the most abundant ASVs in the blank
sort(blank_counts[blank_counts > 0], decreasing = TRUE) %>% head(20)
# Look at the taxonomy of those ASVs
top_blank_asvs <- names(sort(blank_counts, decreasing = TRUE))[1:20]
tax_table(ps)[top_blank_asvs, ]This is often illuminating. Common kit and lab contaminants include Cutibacterium, Ralstonia, Corynebacterium, Sphingomonas and a handful of other taxa that are ubiquitous in DNA extraction kits. If you see these dominating the blank, the result is consistent with reagent-derived contamination.
# Threshold: ASVs with more than 100 reads in the blank are flagged as
# contaminants. This is a conservative, dataset-specific choice — see
# below.
contaminant_threshold <- 100
contaminants <- names(blank_counts)[blank_counts > contaminant_threshold]
length(contaminants)
contaminants
# Remove the flagged ASVs from the entire dataset
n_before <- ntaxa(ps)
ps <- subset_taxa(ps, !taxa_names(ps) %in% contaminants)
n_after <- ntaxa(ps)
sprintf(
"Removed %d contaminant ASVs based on blank (%d -> %d)",
n_before - n_after, n_before, n_after
)About the threshold. The choice of 100 reads is not magic. It is a conservative cut-off appropriate to this dataset, where the blank had a low total read depth and any ASV exceeding 100 reads in it is genuinely suspect. With a deeper blank, a relative threshold (e.g. ASVs whose read count in the blank exceeds X% of their total across all samples) is often more appropriate. Whatever you choose, document it.
The blank has served its purpose for contaminant flagging and should now be removed. The unfermented brine control (Control_fermentation) contains the natural starting microbiota of the raw materials — it is a useful reference but is not part of the fermentation comparison.
# Remove the extraction blank
ps <- subset_samples(ps, sample_names(ps) != "Control_extraction")
# Drop the two commercial samples that failed sequencing
ps <- subset_samples(ps, !sample_names(ps) %in% c("Commercial_R1", "Commercial_R3"))
# After removing samples, some ASVs may now have zero reads everywhere.
ps <- prune_taxa(taxa_sums(ps) > 0, ps)
psNote on the surviving reference samples. Two of the three commercial replicates (R1 and R3) failed sequencing and were dropped above on QC grounds. The third replicate (Commercial_R2) and the unfermented brine baseline (Control_fermentation) are kept in the object — but with a specific role. They are reference samples, not part of the fermentation comparison: the commercial sample shows what an industrial product looks like, and the unfermented control shows the starting community before fermentation begins. We will use them as visual reference points on the ordinations and composition plots, but we will exclude them from any formal statistical test: with n = 1 there is no within-group variance for a centroid-based test to read.
A short summary of what came out of the cleaning is helpful to record before moving on.
ps
# Sample composition
table(sample_data(ps)$Fermentation, sample_data(ps)$Time_point, useNA = "ifany")
# Read-depth distribution after cleaning
sample_sums(ps) %>% summary()
# Phylum-level breakdown of remaining ASVs
table(tax_table(ps)[, "Phylum"], useNA = "ifany")# Save the cleaned phyloseq object for use in subsequent blocks.
saveRDS(ps, file = "phyloseq_clean.rds")In the next block we will load this .rds, apply singleton filtering and abundance transformations.