In this section, you’ll learn to assess bin quality using two complementary tools:
These tools answer different questions and reveal different types of contamination. Together, they provide a comprehensive view of bin quality.
The Workshop material is available to download in Figshare
Not all bins are created equal. Some represent nearly complete genomes while others are fragments or chimeras mixing multiple organisms. Quality assessment is critical before using MAGs for downstream analyses.
Completeness estimates what percentage of the genome is present, measured by checking for expected single-copy marker genes. A set of genes that should appear exactly once in every genome is used as a reference. If 95 out of 100 expected markers are found, the MAG is ~95% complete.
Contamination estimates whether multiple genomes are mixed together, measured by checking for duplicated single-copy marker genes. If marker genes appear more than once, it suggests sequences from multiple organisms were binned together. For example, finding the same ribosomal protein gene three times indicates potential contamination from three organisms.
Important: These are estimates based on universal marker genes, not perfect measurements. Novel organisms may lack expected markers (underestimating completeness), and true gene duplications can look like contamination.
The community has established standards for MAG quality (Bowers et al. 2017):
| Quality Tier | Completeness | Contamination | Additional Requirements |
|---|---|---|---|
| High | >90% | <5% | 23S, 16S, 5S rRNA; ≥18 tRNAs |
| Medium | ≥50% | <10% | - |
| Low | <50% | <10% | - |
Most downstream analyses require at least medium-quality MAGs. High-quality MAGs are rare but extremely valuable as they represent near-complete genomes comparable to isolate genomes.
CheckM2 uses machine learning to estimate MAG quality rapidly and accurately. It supports bacteria and archaea (separate marker sets), is faster than the original CheckM (~10x speedup), and works well even for novel organisms divergent from reference databases. CheckM2 will be used in this workshop to assess bin quality.
Metagenomic samples often contain sequences from multiple domains of life. Coffee fermentation includes bacteria (our target), yeasts (eukaryotic, biologically relevant), and plant DNA (eukaryotic, contamination from coffee beans).
CheckM2 is designed for bacteria and archaea. If bins contain eukaryotic sequences (yeasts or plant DNA), CheckM2 may report spurious results or high contamination. Tiara helps identify which bins are truly bacterial versus eukaryotic or mixed, explaining why some bins have unexpected CheckM2 results.
Tiara is a deep learning tool that classifies DNA sequences into: bacteria, archaea, eukarya (eukaryotes), organelle (mitochondria, plastids), and unknown (ambiguous sequences). It’s fast (seconds per bin), works on assembled contigs, and helps identify contamination sources.
Understanding sequence composition reveals different contamination types:
Eukaryotic contamination - Yeast or plant sequences mixed with bacteria. CheckM2 will be confused (looking for bacterial markers in eukaryotic sequences).
Bacterial-bacterial mixing - Multiple bacterial species binned together. CheckM2 detects this as high contamination (duplicated bacterial markers).
The key insight: CheckM2 contamination and Tiara “Mixed” classification represent different types of contamination. You need both tools to understand bin quality fully.
Make sure you have:
export WSUSER=/shared/team/users/{your_name}/
mkdir -p $WSUSER
# Copy the session's material to your location
cp -r /shared/team/2025_training/week5/tutorial/Session2_binning $WSUSER/
ls -lh Session2_binning
conda activate /shared/team/conda/aliseponsero.mmb-dtp/tiara
# Verify installation
tiara -h
tiara -i INPUT.fa -o OUTPUT.txt --threads N
Key parameters:
-i or --input - Input FASTA file (your bin)-o or --output - Output classification file--threads - Number of threads to use-m or --min_len - Minimum contig length (default: 1000)Let’s start with a bin that should be predominantly bacterial.
# Run Tiara
tiara -i Session2_binning/Example_bins/T16_ERR2231569_SemiBin_29.fa -o tiara_bacterial.txt --threads 2
Let’s now run on a predominantly eukaryotic bin:
# Run Tiara
tiara -i Session2_binning/Example_bins/T24_ERR2231569_SemiBin_1.fa -o tiara_eukaryotic.txt --threads 2
Sometimes binning algorithms as unable to generate good bins:
# Run Tiara
tiara -i Session2_binning/Example_bins/T24_ERR2231570_SemiBin_133.fa -o tiara_mixed.txt --threads 2
Tiara produces a tab-delimited file:
sequence_id classification probability
contig_001 bacteria 0.98
contig_002 bacteria 0.95
contig_003 eukarya 0.87
...
Columns:
sequence_id - Contig nameclassification - Predicted class (bacteria/archaea/eukarya/organelle/unknown)probability - Classification confidence (0-1)Link to R Markdown: Tiara and CheckM2 Exploration
In this R session, you’ll:
Q1: How many bins are predominantly bacterial (>80% bacteria)?
Q2: How many bins have significant eukaryotic content (>20% eukarya)?
Q3: Do eukaryotic bins have high CheckM2 contamination?
Q4: Do “Mixed” bins (bacteria + eukarya) have high CheckM2 contamination?
CheckM2 and Tiara detect different types of contamination:
CheckM2 contamination - Detects bacterial-bacterial mixing (duplicated bacterial markers)
Tiara “Mixed” classification - Detects bacterial-eukaryotic mixing (bacteria + yeast/plant DNA)
Important finding: Mixed bins (bacteria + eukarya) often have LOW CheckM2 contamination because CheckM2 only analyzes the bacterial portion and ignores eukaryotic sequences!
– Tiara: https://github.com/ibe-uw/tiara
– CheckM2: https://github.com/chklovski/CheckM2