5 Denoising, Merging and Chimera Removal

5.1 Denoising with DADA2

The core DADA2 algorithm uses the error model learned in the previous chapter to distinguish true biological sequences from sequences that arose through PCR or sequencing errors. It works by asking: given the error model, is this slightly different sequence more likely to be a genuine variant, or an error-derived copy of a more abundant sequence?

The result is a set of Amplicon Sequence Variants (ASVs): exact sequences with no artificial clustering.

dadaFs_file <- file.path(path_precomp, "dadaFs.rds")
dadaRs_file <- file.path(path_precomp, "dadaRs.rds")

if (!file.exists(dadaFs_file)) {
  message("Denoising forward reads...")
  dadaFs <- dada(filtFs, err = errF, multithread = TRUE)
  saveRDS(dadaFs, dadaFs_file)
} else {
  dadaFs <- readRDS(dadaFs_file)
}

Sample 1 - 13226 reads in 4843 unique sequences.
Sample 2 - 178641 reads in 22804 unique sequences.
Sample 3 - 12109 reads in 4696 unique sequences.
Sample 4 - 1419 reads in 581 unique sequences.
Sample 5 - 34239 reads in 11798 unique sequences.
Sample 6 - 193289 reads in 26735 unique sequences.
Sample 7 - 221122 reads in 21753 unique sequences.
Sample 8 - 150467 reads in 15443 unique sequences.
Sample 9 - 189922 reads in 24078 unique sequences.
Sample 10 - 227640 reads in 22363 unique sequences.
Sample 11 - 210626 reads in 20738 unique sequences.
Sample 12 - 191040 reads in 24153 unique sequences.
Sample 13 - 233210 reads in 29369 unique sequences.
Sample 14 - 107509 reads in 25626 unique sequences.
Sample 15 - 219755 reads in 23609 unique sequences.
Sample 16 - 213873 reads in 22116 unique sequences.
Sample 17 - 230240 reads in 29372 unique sequences.
Sample 18 - 209712 reads in 42650 unique sequences.
Sample 19 - 241630 reads in 45010 unique sequences.
Sample 20 - 209072 reads in 36916 unique sequences.
Sample 21 - 253245 reads in 32445 unique sequences.
Sample 22 - 243119 reads in 43825 unique sequences.
Sample 23 - 238579 reads in 37930 unique sequences.
Sample 24 - 217397 reads in 41176 unique sequences.
Sample 25 - 218668 reads in 39233 unique sequences.
Sample 26 - 239236 reads in 43927 unique sequences.
Sample 27 - 237723 reads in 42642 unique sequences.
Sample 28 - 208982 reads in 43512 unique sequences.
Sample 29 - 256480 reads in 44096 unique sequences.

if (!file.exists(dadaRs_file)) {
  message("Denoising reverse reads...")
  dadaRs <- dada(filtRs, err = errR, multithread = TRUE)
  saveRDS(dadaRs, dadaRs_file)
} else {
  dadaRs <- readRDS(dadaRs_file)
}

Sample 1 - 13226 reads in 6970 unique sequences.
Sample 2 - 178641 reads in 28216 unique sequences.
Sample 3 - 12109 reads in 6149 unique sequences.
Sample 4 - 1419 reads in 729 unique sequences.
Sample 5 - 34239 reads in 15345 unique sequences.
Sample 6 - 193289 reads in 32391 unique sequences.
Sample 7 - 221122 reads in 29710 unique sequences.
Sample 8 - 150467 reads in 22351 unique sequences.
Sample 9 - 189922 reads in 30935 unique sequences.
Sample 10 - 227640 reads in 29549 unique sequences.
Sample 11 - 210626 reads in 28342 unique sequences.
Sample 12 - 191040 reads in 33939 unique sequences.
Sample 13 - 233210 reads in 35191 unique sequences.
Sample 14 - 107509 reads in 18547 unique sequences.
Sample 15 - 219755 reads in 31439 unique sequences.
Sample 16 - 213873 reads in 29943 unique sequences.
Sample 17 - 230240 reads in 36340 unique sequences.
Sample 18 - 209712 reads in 53263 unique sequences.
Sample 19 - 241630 reads in 55632 unique sequences.
Sample 20 - 209072 reads in 47285 unique sequences.
Sample 21 - 253245 reads in 41514 unique sequences.
Sample 22 - 243119 reads in 59760 unique sequences.
Sample 23 - 238579 reads in 52820 unique sequences.
Sample 24 - 217397 reads in 51954 unique sequences.
Sample 25 - 218668 reads in 49783 unique sequences.
Sample 26 - 239236 reads in 54978 unique sequences.
Sample 27 - 237723 reads in 49322 unique sequences.
Sample 28 - 208982 reads in 52167 unique sequences.
Sample 29 - 256480 reads in 55448 unique sequences.

5.1.1 What does the dada2 output tell us?

# Inspect the first sample's denoising result
dadaFs[[1]]

dada-class: object describing DADA2 denoising results
90 sequence variants were inferred from 4843 input unique sequences.
Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

The output reports how many unique sequences were identified in the sample and how many reads support each one. The ratio of reads to unique sequences gives an indication of how much error correction was performed.

5.2 Merging paired-end reads

After denoising, forward and reverse reads are merged to reconstruct the full amplicon sequence. Merging requires a minimum overlap region between the denoised forward and reverse reads — by default, at least 12 bp.

mergers_file <- file.path(path_precomp, "mergers.rds")

if (!file.exists(mergers_file)) {
  message("Merging paired reads...")
  mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose = TRUE)
  saveRDS(mergers, mergers_file)
} else {
  mergers <- readRDS(mergers_file)
}

Merging failure

If a large proportion of reads fail to merge (visible in the tracking table later), the most common causes are:

Insufficient overlap — increase truncLen for one or both reads, or reduce it if the reads are too short
Primer residues — verify that primers were fully removed before this analysis
Wrong amplicon size — the reads must cover the amplicon with some overlap

5.3 Build the sequence table

makeSequenceTable() produces a sample × ASV matrix of read counts. This is the central data object in DADA2.

seqtab <- makeSequenceTable(mergers)
cat("Dimensions (samples × ASVs):", dim(seqtab), "\n")

Dimensions (samples × ASVs): 29 18934

5.3.1 Sequence length distribution

The expected amplicon length for the V3–V4 region is approximately 400–470 bp. Any sequences outside this range are likely artefacts (non-specific amplification or chimeras) and will often be removed in subsequent steps.

seq_lengths <- table(nchar(getSequences(seqtab)))
print(seq_lengths)


  260   261   385   387   388   394   398   399   400   401   402   403   404 
   62     2     1    12     1     1     1     1     2     5   132     8    75 
  405   406   407   408   409   410   411   412   413   414   416   421   422 
    5    12   112     7     7     5     6     3     4     3     2    12   102 
  425   426   427   428   443   444 
   13   652 15440  1785     2   459

# Visualise
barplot(seq_lengths,
        xlab = "Sequence length (bp)",
        ylab = "Number of ASVs",
        main = "ASV length distribution (before chimera removal)",
        las  = 2)

5.4 Chimera removal

Chimeric sequences are PCR artefacts created when an incomplete extension product from one cycle acts as a primer in a subsequent cycle, fusing the 5′ end of one template with the 3′ end of another. DADA2 identifies chimeras using the removeBimeraDenovo() function, which checks whether each sequence can be reconstructed from two more-abundant “parent” sequences.

seqtab_file <- file.path(path_precomp, "seqtab_nochim.rds")

if (!file.exists(seqtab_file)) {
  message("Removing chimeras...")
  seqtab.nochim <- removeBimeraDenovo(seqtab,
                                       method      = "consensus",
                                       multithread = TRUE,
                                       verbose     = TRUE)
  saveRDS(seqtab.nochim, seqtab_file)
} else {
  seqtab.nochim <- readRDS(seqtab_file)
}

cat("ASVs before chimera removal:", ncol(seqtab), "\n")

ASVs before chimera removal: 18934

cat("ASVs after  chimera removal:", ncol(seqtab.nochim), "\n")

ASVs after  chimera removal: 1091

pct_retained <- round(100 * sum(seqtab.nochim) / sum(seqtab), 1)
cat("Proportion of reads retained after chimera removal:", pct_retained, "%\n")

Proportion of reads retained after chimera removal: 89.8 %

Interpreting chimera removal

It is normal for a large number of unique sequences to be flagged as chimeric — often 10–30 % of ASVs. However, chimeric reads usually represent a small proportion of total reads (< 10 %), because chimeras are typically rare relative to their parent sequences. If you lose > 20 % of reads here, double-check that primers were fully trimmed.

seq_lengths_nc <- table(nchar(getSequences(seqtab.nochim)))
print(seq_lengths_nc)


260 261 385 387 388 394 399 400 401 402 403 404 405 406 407 408 409 410 411 412 
 21   1   1  12   1   1   1   2   5 113   8  20   5   4  86   7   7   4   3   2 
413 414 416 421 422 425 426 427 428 444 
  1   3   2   9  80   3  57 482  79  71

barplot(seq_lengths_nc,
        xlab = "Sequence length (bp)",
        ylab = "Number of ASVs",
        main = "ASV length distribution (after chimera removal)",
        las  = 2)

5.5 Tracking reads through the pipeline

A key quality-control step is to track how many reads survive each stage of the pipeline. Large losses at any single step point to a specific problem.

getN <- function(x) sum(getUniques(x))

track <- cbind(
  out,
  denoisedF = sapply(dadaFs, getN),
  denoisedR = sapply(dadaRs, getN),
  merged    = sapply(mergers, getN),
  nonchim   = rowSums(seqtab.nochim)
)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR",
                     "merged", "nonchim")
rownames(track) <- sample.names

write.csv(track, "dada2_QC.csv")
track

                      input filtered denoisedF denoisedR merged nonchim
Commercial_R1         16778    13226     13183     13200  13113   12499
Commercial_R2        218777   178641    177733    178459 176766  176479
Commercial_R3         14993    12109     12049     12098  11922   11670
Control_extraction     2345     1419      1411      1414   1403    1403
Control_fermentation  42709    34239     34114     34105  33515   33162
Lactobacillus_H24_R1 236996   193289    192552    192803 190009  178597
Lactobacillus_H24_R2 271744   221122    220906    220980 220054  217351
Lactobacillus_H24_R3 185402   150467    150220    150328 149791  148672
Lactobacillus_H48_R1 232087   189922    189536    189665 187248  177255
Lactobacillus_H48_R2 277388   227640    227197    227476 226648  222104
Lactobacillus_H48_R3 257945   210626    210217    210457 209083  205846
Lactobacillus_W1_R1  237242   191040    190633    190702 189362  185573
Lactobacillus_W1_R2  283367   233210    232746    232883 230755  224196
Lactobacillus_W1_R3  135637   107509    107352    107386 106982  105252
Lactobacillus_W2_R1  270194   219755    219374    219573 218558  217417
Lactobacillus_W2_R2  262219   213873    213132    213708 212577  209841
Lactobacillus_W2_R3  280448   230240    229271    229942 227359  218116
Spontaneous_H24_R1   260410   209712    208384    208726 197139  162260
Spontaneous_H24_R2   297828   241630    240341    240755 230290  169513
Spontaneous_H24_R3   258467   209072    208224    208561 202745  166417
Spontaneous_H48_R1   311391   253245    252497    252931 229750  203601
Spontaneous_H48_R2   302367   243119    241992    242408 232963  186846
Spontaneous_H48_R3   295072   238579    237584    238146 230008  193478
Spontaneous_W1_R1    266958   217397    216521    216709 206641  165163
Spontaneous_W1_R2    271112   218668    218010    217981 207723  180948
Spontaneous_W1_R3    295075   239236    238162    238617 230440  187702
Spontaneous_W2_R1    291669   237723    236868    237129 225123  190377
Spontaneous_W2_R2    259494   208982    208097    208521 200762  151955
Spontaneous_W2_R3    314666   256480    255722    255872 248160  210209

5.5.1 Visualise read tracking

step_labels <- c(
  input     = "Raw Input",
  filtered  = "Quality Filtered",
  denoisedF = "Denoised (Fwd)",
  denoisedR = "Denoised (Rev)",
  merged    = "Merged Pairs",
  nonchim   = "Non-chimeric"
)

track_long <- as.data.frame(track) %>%
  rownames_to_column("Sample") %>%
  pivot_longer(-Sample, names_to = "Step", values_to = "Reads") %>%
  mutate(Step = factor(Step, levels = names(step_labels)))

ggplot(track_long, aes(x = Step, y = Reads, fill = Step)) +
  geom_boxplot(outlier.size = 1) +
  scale_x_discrete(labels = step_labels) +
  scale_fill_viridis_d() +
  labs(x = "Pipeline Step", y = "Number of Reads",
       title = "Read counts through the DADA2 pipeline") +
  theme_minimal() +
  theme(axis.text.x  = element_text(angle = 45, hjust = 1),
        legend.position = "none")

What to look for:

A gradual, smooth decline across steps is expected and healthy
A sudden large drop at the filtering step → truncation lengths may be too aggressive
A large drop at merging → check overlap between reads; check primers are removed
A large drop at chimera removal → investigate whether primers are truly absent