Part 1: Binning Quality Control


Overview

In this section, you’ll learn to assess bin quality using two complementary tools:

  • Tiara - Classifies sequences as bacterial, archaeal, or eukaryotic
  • CheckM2 - Assesses completeness and contamination of bacterial/archaeal bins

These tools answer different questions and reveal different types of contamination. Together, they provide a comprehensive view of bin quality.

The Workshop material is available to download in Figshare


Quality Assessment

Not all bins are created equal. Some represent nearly complete genomes while others are fragments or chimeras mixing multiple organisms. Quality assessment is critical before using MAGs for downstream analyses.

Completeness and Contamination

Completeness estimates what percentage of the genome is present, measured by checking for expected single-copy marker genes. A set of genes that should appear exactly once in every genome is used as a reference. If 95 out of 100 expected markers are found, the MAG is ~95% complete.

Contamination estimates whether multiple genomes are mixed together, measured by checking for duplicated single-copy marker genes. If marker genes appear more than once, it suggests sequences from multiple organisms were binned together. For example, finding the same ribosomal protein gene three times indicates potential contamination from three organisms.

Important: These are estimates based on universal marker genes, not perfect measurements. Novel organisms may lack expected markers (underestimating completeness), and true gene duplications can look like contamination.

MIMAG Quality Standards

The community has established standards for MAG quality (Bowers et al. 2017):

Quality Tier Completeness Contamination Additional Requirements
High >90% <5% 23S, 16S, 5S rRNA; ≥18 tRNAs
Medium ≥50% <10% -
Low <50% <10% -

Most downstream analyses require at least medium-quality MAGs. High-quality MAGs are rare but extremely valuable as they represent near-complete genomes comparable to isolate genomes.

CheckM2: Quality Assessment Tool

CheckM2 uses machine learning to estimate MAG quality rapidly and accurately. It supports bacteria and archaea (separate marker sets), is faster than the original CheckM (~10x speedup), and works well even for novel organisms divergent from reference databases. CheckM2 will be used in this workshop to assess bin quality.


Sequence Classification with Tiara

Metagenomic samples often contain sequences from multiple domains of life. Coffee fermentation includes bacteria (our target), yeasts (eukaryotic, biologically relevant), and plant DNA (eukaryotic, contamination from coffee beans).

CheckM2 is designed for bacteria and archaea. If bins contain eukaryotic sequences (yeasts or plant DNA), CheckM2 may report spurious results or high contamination. Tiara helps identify which bins are truly bacterial versus eukaryotic or mixed, explaining why some bins have unexpected CheckM2 results.

Tiara is a deep learning tool that classifies DNA sequences into: bacteria, archaea, eukarya (eukaryotes), organelle (mitochondria, plastids), and unknown (ambiguous sequences). It’s fast (seconds per bin), works on assembled contigs, and helps identify contamination sources.

Types of Contamination

Understanding sequence composition reveals different contamination types:

Eukaryotic contamination - Yeast or plant sequences mixed with bacteria. CheckM2 will be confused (looking for bacterial markers in eukaryotic sequences).

Bacterial-bacterial mixing - Multiple bacterial species binned together. CheckM2 detects this as high contamination (duplicated bacterial markers).

The key insight: CheckM2 contamination and Tiara “Mixed” classification represent different types of contamination. You need both tools to understand bin quality fully.


Hands-On: Running Tiara

Prerequisites

Make sure you have:

  • Access to the climb shared project
  • Data symlinked / copied to your workspace
export WSUSER=/shared/team/users/{your_name}/
mkdir -p $WSUSER

# Copy the session's material to your location
cp -r /shared/team/2025_training/week5/tutorial/Session2_binning $WSUSER/

ls -lh Session2_binning
  • Tiara conda environment available
conda activate /shared/team/conda/aliseponsero.mmb-dtp/tiara
  • RStudio access on the notebook

Check Installation

# Verify installation
tiara -h

Understanding Tiara Usage

tiara -i INPUT.fa -o OUTPUT.txt --threads N

Key parameters:

  • -i or --input - Input FASTA file (your bin)
  • -o or --output - Output classification file
  • --threads - Number of threads to use
  • -m or --min_len - Minimum contig length (default: 1000)

Running Tiara on Example Bins

Let’s start with a bin that should be predominantly bacterial.

# Run Tiara
tiara -i Session2_binning/Example_bins/T16_ERR2231569_SemiBin_29.fa -o tiara_bacterial.txt --threads 2

Let’s now run on a predominantly eukaryotic bin:

# Run Tiara
tiara -i Session2_binning/Example_bins/T24_ERR2231569_SemiBin_1.fa -o tiara_eukaryotic.txt --threads 2

Sometimes binning algorithms as unable to generate good bins:

# Run Tiara
tiara -i Session2_binning/Example_bins/T24_ERR2231570_SemiBin_133.fa -o tiara_mixed.txt --threads 2

Tiara Output Format

Tiara produces a tab-delimited file:

sequence_id classification probability
contig_001  bacteria 0.98
contig_002  bacteria 0.95
contig_003  eukarya  0.87
...

Columns:

  • sequence_id - Contig name
  • classification - Predicted class (bacteria/archaea/eukarya/organelle/unknown)
  • probability - Classification confidence (0-1)

Exploring Tiara and CHeckM2 Results in R

Open RStudio and Load Data

Link to R Markdown: Tiara and CheckM2 Exploration

In this R session, you’ll:

  1. Load pre-computed Tiara results for all bins
  2. Visualize bin composition (bacteria vs eukarya)
  3. Classify bins as “Bacterial”, “Archaeal”, “Eukaryotic”, or “Mixed”
  4. Load CheckM2 quality metrics
  5. Merge Tiara and CheckM2 data
  6. Investigate: Does eukaryotic content explain CheckM2 contamination?

Key Questions to Explore

Q1: How many bins are predominantly bacterial (>80% bacteria)?

Q2: How many bins have significant eukaryotic content (>20% eukarya)?

Q3: Do eukaryotic bins have high CheckM2 contamination?

Q4: Do “Mixed” bins (bacteria + eukarya) have high CheckM2 contamination?

The Key Insight

CheckM2 and Tiara detect different types of contamination:

CheckM2 contamination - Detects bacterial-bacterial mixing (duplicated bacterial markers)

Tiara “Mixed” classification - Detects bacterial-eukaryotic mixing (bacteria + yeast/plant DNA)

Important finding: Mixed bins (bacteria + eukarya) often have LOW CheckM2 contamination because CheckM2 only analyzes the bacterial portion and ignores eukaryotic sequences!


Resources

Tiara: https://github.com/ibe-uw/tiara
CheckM2: https://github.com/chklovski/CheckM2

  • MIMAG Standards: Bowers et al. (2017) Nature Biotechnology