Workshop Overview: Taxonomic Profiling of Metagenomic Data

Profiling

Workshop Information

This hands-on workshop introduces taxonomic profiling methods for metagenomic sequencing data. You’ll learn to profile microbial communities using MetaPhlAn4, Kraken2, and Bracken, then explore results interactively with Pavian.

We’ll work with coffee bean fermentation samples to discover how microbial communities change over time.

What you’ll learn:

Understand marker gene vs k-mer based profiling approaches
Run and compare MetaPhlAn4, Kraken2, and Bracken
Interpret taxonomic profiles and choose appropriate methods
Visualize community composition with Pavian

Workshop structure: Theory and live demonstrations, followed by hands-on exploration of coffee fermentation microbiomes using Pavian, and group discussion.

Dataset: Six timepoints from coffee bean fermentation showing microbial succession from diverse communities to lactic acid bacteria dominance.

The Workshop material is available to download in Figshare

What is Metagenomic Taxonomic Profiling?

Taxonomic profiling answers a fundamental question in microbiology: “who is there and how much?” Traditional microbiology requires culturing organisms individually, but most microbes cannot be cultured in the lab. Metagenomics extracts DNA directly from environmental samples and sequences everything together, allowing us to study entire communities including unculturable organisms.

Taxonomic profiling takes millions of raw sequencing reads and determines which organisms they came from and their relative abundances. The output is a list showing, for example, that 35% of the community is Leuconostoc mesenteroides, 16% is Lactiplantibacillus plantarum, and so on. This compositional data reveals community structure and can be used to compare samples, track changes over time, or identify organisms of interest.

Why profile metagenomes? Profiling is typically the first step in metagenomic analysis. It’s used in microbiome research (human gut, soil, ocean), environmental monitoring (wastewater treatment, bioreactors), food and agriculture (fermentation monitoring, food safety), and clinical applications (pathogen detection, disease associations). For our workshop, we’re using profiling to understand how microbial communities change during coffee bean fermentation.

The challenge: Millions of short DNA reads must be accurately assigned to organisms from a reference database containing thousands of species. Related organisms share similar sequences, novel organisms aren’t in databases, and the analysis must be computationally feasible. Different profiling tools make different trade-offs between speed, sensitivity, and specificity.

Profiling Approaches and Tools

Two main strategies exist for taxonomic profiling. Marker gene-based approaches like MetaPhlAn4 focus on carefully selected unique sequences that identify specific organisms, similar to using fingerprints for identification. This provides high confidence in detected organisms but may miss taxa not covered by markers. K-mer based approaches like Kraken2 match short DNA sequences against comprehensive databases, providing fast and sensitive detection but with more potential false positives.

MetaPhlAn4: Marker Gene Profiling

MetaPhlAn4 uses a database of clade-specific marker genes to profile communities. It only analyzes reads mapping to these unique markers, making it fast and highly specific. The conservative approach means fewer false positives but often leaves a large proportion of reads unclassified, especially in understudied environments.

Kraken2: K-mer Classification

Kraken2 breaks reads into k-mers and matches them against a large database using exact matching. It’s extremely fast, processing millions of reads per minute, and highly sensitive. However, raw Kraken2 reports read counts rather than normalized abundances, and many reads are classified only to higher taxonomic levels like genus or family rather than species.

Bracken: Abundance Re-estimation

Bracken refines Kraken2 results by redistributing reads classified at higher taxonomic levels down to species level using Bayesian inference. It calculates the probability that genus-level reads actually came from specific species based on expected k-mer distributions. This process filters out poorly supported assignments and produces more accurate species-level abundance estimates, combining Kraken2’s sensitivity with improved specificity.

Pavian: Interactive Visualization

Pavian is an R Shiny application providing interactive visualization of taxonomic profiles through Sankey diagrams, sample comparisons, and data tables. It accepts Kraken2, Bracken, and MetaPhlAn4 outputs, making it easy to explore patterns without coding. You’ll use Pavian to discover temporal dynamics in the coffee fermentation samples.

Choosing the Right Tool

The choice between methods depends on your research question. MetaPhlAn4 is ideal when you need high confidence in detected organisms and are working with well-characterized environments. Kraken2 with Bracken provides comprehensive coverage for diverse or novel environments where sensitivity matters. When uncertain, run both approaches and compare, agreement between methods increases confidence, while disagreements reveal interesting biology or technical artifacts worth investigating.

Key concepts to remember: All profilers report relative abundances (percentages summing to 100%), not absolute cell counts. Unclassified reads represent organisms not in reference databases or sequences that can’t be confidently assigned. Different tools make different trade-offs : MetaPhlAn4 prioritizes precision (few false positives), Kraken2 prioritizes recall (detecting most organisms), and Bracken balances both.

Ready to start profiling? Let’s explore coffee fermentation microbiomes! ☕🔬

Next submodule:

MetaPhlAn4