Workshop Overview: MAGs and Binning

Workshop Information

This hands-on workshop introduces metagenome-assembled genomes (MAGs) and the binning process used to recover individual genomes from complex metagenomic samples. You’ll learn to assess bin quality, classify sequences, dereplicate redundant bins, and assign taxonomy using GTDB-Tk. We’ll work with coffee bean fermentation samples to explore how different assembly strategies affect MAG recovery.

What you’ll learn:

Understand what MAGs are and why we generate them
Classify sequences as bacterial, archaeal, or eukaryotic using Tiara
Assess MAG quality using CheckM2 and MIMAG standards
Dereplicate bins to obtain non-redundant MAG sets
Assign taxonomy using GTDB-Tk and understand classification confidence
Compare individual assembly vs co-assembly strategies
Recognize MAG limitations and when to use alternative approaches

Workshop structure: Theory presentations alternating with live demonstrations and hands-on data exploration in R, followed by independent comparative analysis.

Dataset: Coffee bean fermentation samples from Ecuador showing microbial succession from early colonizers through lactic acid bacteria dominance.

The Workshop material is available to download in Figshare

What are MAGs?

Traditional microbiology requires isolating and culturing organisms individually, but most microbes cannot be cultured in the lab. Metagenomics extracts DNA directly from environmental samples and sequences everything together. But how do we recover individual genomes from this mixed soup of DNA?

Metagenome-Assembled Genomes (MAGs) are draft genomes reconstructed computationally from metagenomic data without culturing. The process involves assembling short DNA reads into longer contigs, then grouping (binning) contigs that likely came from the same organism. The result is a collection of genome bins representing individual species or strains in the community.

Why generate MAGs? MAGs provide access to the genomic content of unculturable organisms, enabling discovery of novel species and metabolic capabilities. They allow genome-resolved metagenomics, moving beyond “who is there?” to “what can they do?” and “how do they differ?” MAGs are used to study novel organisms (including entire new phyla), functional potential (biosynthetic gene clusters, metabolic pathways), strain-level diversity, and horizontal gene transfer.

The process overview:

Assembly - Assemble short reads into longer contigs (done before workshop)
Binning - Group contigs from same organism using composition and coverage signals
Quality assessment - Evaluate completeness and contamination
Dereplication - Remove redundant bins from multiple assemblies
Taxonomy - Assign taxonomic classification to MAGs

For our workshop, we’re using pre-computed bins from coffee fermentation samples to understand how fermentation microbiomes can be characterized at the genome level.

Assembly Strategies

Before binning, metagenomic reads must be assembled into contigs. Two main strategies exist, each with different trade-offs.

Individual Assembly

Each sample is assembled separately. This preserves sample-specific variation, including strain differences across timepoints or conditions. Individual assembly is better for capturing temporal dynamics and strain-level diversity, but may struggle with low-abundance organisms that have insufficient coverage in single samples.

Co-assembly

All samples are combined and assembled together. This increases overall coverage, making it easier to assemble low-abundance organisms that appear across multiple samples. However, co-assembly collapses strain variation, creating consensus sequences that represent the “average” genome across samples. This can lead to chimeric assemblies if closely related strains differ between samples.

Which to choose? The answer depends on your question. For studies focused on strain dynamics or temporal changes, individual assembly preserves important variation. For recovering rare organisms or maximizing diversity discovery, co-assembly provides better sensitivity. The best practice is often to use both approaches and dereplicate the results.

For this workshop: We’ll work primarily with individual assembly bins to understand the core workflow, then explore how co-assembly results differ during the independent analysis section.

Binning: Grouping Contigs into Genomes

Binning algorithms group contigs that likely came from the same organism. This is challenging because we don’t know which organism each contig came from, and related organisms share similar sequences.

Two Main Signals

1. Composition (Tetranucleotide Frequency)

Different organisms have different genomic “signatures” based on their DNA composition patterns. Specifically, the frequency of 4-nucleotide sequences (tetranucleotides like ATCG, GCTA, etc.) is relatively consistent within a genome but varies between species. This reflects evolutionary history, GC content, and codon usage patterns. Binning algorithms calculate these frequencies for each contig and group contigs with similar patterns.

2. Coverage (Abundance)

If multiple samples are available, organisms present at different abundances across samples create differential coverage patterns. A contigs from the same organism should have correlated coverage across samples. For example, if an organism increases from 1% to 10% abundance between timepoints, all its contigs should show that 10-fold increase. This signal is particularly powerful for separating closely related organisms.

Why Binning Works (and When It Fails)

Binning works well when organisms have distinct compositional signatures and differential coverage patterns. It succeeds for moderate-to-high abundance organisms (>1% of community) with sufficient coverage for assembly, and organisms genomically distinct from their neighbors.

Binning struggles with closely related strains (nearly identical composition), low-abundance organisms (insufficient coverage for good statistics), organisms with unusual genomic features (many repeats, mobile elements), and highly fragmented assemblies (short contigs have less signal).

Modern Binning Tools

SemiBin2 (used for this workshop) employs deep learning with pre-trained models for different environments. It can use single-sample or multi-sample mode, provides fast and accurate binning, and includes models for human gut, ocean, soil, and other environments.

Other tools include MetaBAT2 (uses coverage and composition, widely used baseline), CONCOCT (focuses on coverage patterns, good for multi-sample data), and MaxBin2 (uses marker genes for quality assessment during binning). Best practice often involves running multiple binners and using tools like DAS Tool to combine results.

Next submodule:

Binning Quality Control