MetaPhlAn4: Marker Gene-Based Taxonomic Profiling


The Workshop material is available to download in Figshare

What is MetaPhlAn4?

MetaPhlAn4 (Metagenomic Phylogenetic Analysis) is a computational tool for profiling the taxonomic composition of microbial communities from metagenomic shotgun sequencing data.

Key Features:

  • Uses ~1.1 million unique clade-specific marker genes
  • Prioritizes specificity over sensitivity
  • Provides relative abundance estimates (%)
  • Reports taxonomy from kingdom to strain level

How Does MetaPhlAn4 Work?

  1. Marker Gene Database (ChocoPhlAn)
    • Contains uniquely identifying marker genes for ~100,000 reference genomes
    • Each marker is specific to a particular clade (species, genus, family, etc.)
    • Markers are carefully selected to avoid cross-mapping
  2. Alignment Step
    • Reads are mapped to the marker gene database using Bowtie2
    • Only reads that map to marker genes are used for profiling
    • Unmapped reads are considered “unclassified”
  3. Abundance Calculation
    • Relative abundances calculated from marker gene coverage
    • Higher coverage = higher abundance
    • Results normalized to sum to 100%

Advantages

High specificity - Low false positive rate
Fast - Only maps to markers, not full genomes
Relative abundances - Ready for downstream statistical analyses

Limitations

⚠️ High unclassified fraction - Especially in novel/understudied environments
⚠️ Database-dependent - Can only detect organisms in the database
⚠️ Lower sensitivity - May miss rare or novel taxa

When to Use MetaPhlAn4?

  • When you need high confidence in detected organisms
  • For comparative studies across samples
  • When working with well-characterized environments (human gut, etc.)

Hands-On: Running and Exploring MetaPhlAn4

Prerequisites

Make sure you have:

  • Access to the climb shared project
  • Data symlinked / copied to your workspace
export WSUSER=/shared/team/users/{your_name}/
mkdir -p $WSUSER

# Symlinks to raw/ and tax_profiles/
ln -s /shared/team/2025_training/week5/coffee/raw $WSUSER/

# Copy the session's material to your location
cp -r /shared/team/2025_training/week5/tutorial/Session1_profiling $WSUSER/

ls -lh
  • MetaPhlAn4 conda environment available
conda activate /shared/team/conda/aliseponsero.mmb-dtp/metaphlan
  • RStudio access on the notebook

Part 1: Understanding the Database

Database Organization

MetaPhlAn4 requires a pre-built database of marker genes. For this workshop, the database is located at:

/shared/team/2025_training/week5/databases/metaphlan/

Key files in the database:

  • *.bt2l - Bowtie2 index files (for read mapping)
  • *.pkl - Marker gene database (pickled Python object)
  • *_species.txt - List of species in the database

To explore:

ls -lh /shared/team/2025_training/week5/databases/metaphlan/

Part 2: Running MetaPhlAn4

Step 1: Activate the Environment

conda activate /shared/team/conda/aliseponsero.mmb-dtp/metaphlan

Step 2: Check Installation

metaphlan --version

Step 3: View Help

metaphlan --help

Key parameters to understand:

  • Input files (paired-end: comma-separated, no space!)
  • --input_type fastq - Specifies input format
  • --db_dir - Path to database directory
  • --offline - Don’t check for database updates
  • --nproc - Number of processors to use
  • -o or --output_file - Path to output file
  • --mapout - Save Bowtie2 alignment (for re-analysis)
  • --skip_unclassified_estimation - Turn off unclassified estimation (default: ON)

Step 4: Construct the Command

For paired-end reads, MetaPhlAn4 requires input files separated by a comma with NO SPACE.

Template:

metaphlan [FORWARD],[REVERSE] \
    --input_type fastq \
    --db_dir [DATABASE_PATH] \
    --offline \
    --nproc [N] \
    -o [OUTPUT] \
    --mapout [MAPPING_OUTPUT]

⏱️ Runtime: 5-10 minutes for the subset sample –> We will use precomputed results

💡 Tip: The --mapout file can be reused to re-run MetaPhlAn4 with different parameters without re-mapping reads!


Part 3: Understanding MetaPhlAn4 Output

Output Format

Open the precomputed folder for Metaphlan outputs you copied in your local personal folder (e.g. T0_ERR2231567.profile.txt).

MetaPhlAn4 produces a tab-delimited file with two columns:

clade_name	relative_abundance
k__Bacteria|p__Firmicutes|c__Bacilli|...|s__Leuconostoc_mesenteroides	15.42
k__Bacteria|p__Firmicutes|c__Bacilli|...|s__Lactiplantibacillus_plantarum	8.73
...

Column 1: Taxonomic lineage

  • Uses | as separator
  • Taxonomic prefixes:
    • k__ = Kingdom
    • p__ = Phylum
    • c__ = Class
    • o__ = Order
    • f__ = Family
    • g__ = Genus
    • s__ = Species
    • t__ = Strain (clade)

Column 2: Relative abundance (%)

  • Values sum to 100% (for classified reads)
  • All taxonomic levels are reported

Optional: UNCLASSIFIED entry

  • If unclassified estimation is enabled, an additional line shows the percentage of unclassified reads (e.g T0_ERR2231567.unclprofile.txt)
  • Format: UNCLASSIFIED 63.4 (63.4% of reads unclassified)

Part 4: Exploring Outputs in RStudio

Open RStudio on your notebook and follow along!

We will need to first install the packages we will use in this session

install.packages("tidyverse")

if (!require(remotes)) { install.packages("remotes") }
remotes::install_github("fbreitwieser/pavian")

Load Required Libraries

library(tidyverse)

Read the MetaPhlAn4 Profile

# Read a precomputed profile
MYDIR="/shared/team/users/{your_name}/"

profile_no_uncl <- read_tsv(paste(MYDIR, "Session1_profiling/Metaphlan/T0_ERR2231567.profile.txt", sep="/"), 
                    skip = 4,
                    show_col_types = FALSE) %>%
        rename("clade_name"=`#clade_name`)

Compare Profiles With/Without Unclassified Estimation

# Read profile WITH unclassified
profile_uncl <- read_tsv(paste(MYDIR, "Session1_profiling/Metaphlan/T0_ERR2231567.unclprofile.txt", sep="/"), 
                         skip = 4,
                    show_col_types = FALSE) %>%
        rename("clade_name"=`#clade_name`)

# Check for UNCLASSIFIED entry
profile_uncl %>%
  filter(clade_name=="UNCLASSIFIED")

# Compare first few rows
profile_uncl %>% head(5)
profile_no_uncl %>% head(5)

Question: What does a high unclassified percentage tell us about the sample?

Extract and Count Species

# Extract only species-level entries (has "s__" but not "t__")
species <- profile_no_uncl %>%
  filter(str_detect(clade_name, "s__"),
         !str_detect(clade_name, "t__"))

# Count species
n_species <- species %>% nrow()
cat("Number of species detected:", n_species, "\n")

Show top 10 most abundant species:

# Sort by abundance in descending order and show top 10
species_sorted <- species %>%
  arrange(desc(.[[3]]))  # Sort by rel abundance

cat("\nTop 10 most abundant species:\n")
species_sorted %>% head(10)

# Extract just the species names for easier reading
species_clean <- species_sorted %>%
  mutate(species_name = str_extract(clade_name, "[^|]+$")) %>%  # Extract after last |
  select(species_name, relative_abundance)  # Rename abundance column

# Show top 10
species_clean %>% 
  head(10) %>%
  print()

# group all species with relative abundance below 5% as "Other"
species_grouped <-  species_clean %>%
    mutate(species_name=ifelse(relative_abundance<5, "Other", species_name)) %>%
    group_by(species_name) %>%
    summarize(relative_abundance=sum(relative_abundance))


Quick visualization :

# Bar plot of top species using ggplot2
species_clean %>%
  head(10) %>%
  mutate(species_name = str_trunc(species_name, 40)) %>%  # Truncate long names
  mutate(species_name = fct_reorder(species_name, relative_abundance)) %>%  # Order by abundance
  ggplot(aes(x = species_name, y = relative_abundance)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 10 Species in ERR2231567",
       x = "Species",
       y = "Relative Abundance (%)") +
  theme_minimal()

species_grouped %>%
  mutate(sample="T0") %>%
  mutate(species_name = str_trunc(species_name, 40)) %>%  # Truncate long names
  ggplot(aes(x = sample, y = relative_abundance, fill=species_name)) +
  geom_bar(stat="identity") +
  coord_flip() +
  labs(title = "Species composition at T0",
       x = "Species",
       y = "Relative Abundance (%)") +
  theme_minimal()

Additional Resources

📖 MetaPhlAn4 Documentation: https://github.com/biobakery/MetaPhlAn
💬 BioBakery Forum: https://forum.biobakery.org/c/Microbial-community-profiling/MetaPhlAn/
📊 MetaPhlAn4 Paper: Blanco-Míguez et al. (2023) Nature Biotechnology



Previous submodule:
Next submodule: