Exploring Functional Profiles with HUMAnN3


Overview

In this section, you’ll explore pre-computed HUMAnN3 results from a coffee bean fermentation time series. You’ll learn to:

  • Understand HUMAnN3 output file structures
  • Explore pathway abundance and coverage
  • Visualize functional changes over fermentation time
  • Interpret stratified outputs to see species contributions
  • Investigate specific pathways of biological interest

The Workshop material is available to download in Figshare


Understanding HUMAnN3 Output Files

HUMAnN3 produces three main output files per sample. Each file provides different but complementary information about the functional profile.

Gene Families File

The genefamilies.tsv file contains abundances of gene families (UniRef90 IDs) detected in your sample.

Units: RPK (Reads Per Kilobase) - normalized by gene length to account for the fact that longer genes naturally get more reads.

Format:

# Gene Family              RPK_Abundance
UniRef90_A0A123           45.2
UniRef90_A0A123|g__Lactobacillus.s__Lactobacillus_plantarum   40.1
UniRef90_A0A456           15.8
UNMAPPED                  2500.0

Stratification: Lines without | are unstratified (community total). Lines with | show which species contribute to that gene family.

UNMAPPED: Reads that didn’t match any gene in the database. This is normal - metagenomic samples contain novel sequences not in reference databases.

Pathway Abundance File

The pathabundance.tsv file contains abundances of metabolic pathways from the MetaCyc database.

Units: CPM (Copies Per Million) - normalized by sequencing depth for fair comparison across samples.

How it’s calculated: HUMAnN3 maps gene families to pathways using MetaCyc definitions. If a pathway requires genes A, B, and C, and all are detected, the pathway abundance is calculated from the gene abundances.

Format:

# Pathway                 CPM_Abundance
PWY-5484                  125.5
PWY-5484|g__Lactobacillus 100.2
PWY-5484|g__Leuconostoc   25.3
UNINTEGRATED              450.0

UNINTEGRATED: Genes detected but not part of any known pathway. This is normal - not all genes belong to characterized pathways.

Pathway Coverage File

The pathcoverage.tsv file shows what proportion of each pathway is present.

Units: Coverage (0-1) representing the fraction of pathway steps detected.

Purpose: Distinguishes complete pathways from incomplete ones. High abundance doesn’t always mean a complete pathway!

Format:

# Pathway         Coverage
PWY-5484          0.95
PWY-6731          0.60
PWY-7332          0.30

Interpretation:

  • Coverage > 0.8: Pathway likely complete and functional
  • Coverage 0.5-0.8: Most components present
  • Coverage < 0.5: Incomplete pathway, missing key steps

Important insight: A pathway with high abundance but low coverage may have just a few highly expressed genes from an incomplete pathway. Always check coverage!


Stratified vs Unstratified Outputs

One of HUMAnN3’s most powerful features is stratification - revealing which species contribute to each function.

Understanding the Format

Unstratified lines have no | symbol and represent community totals:

PWY-5484    125.5

This shows the total abundance of this pathway across all organisms.

Stratified lines contain | separating pathway from species:

PWY-5484|g__Lactobacillus.s__Lactobacillus_plantarum    100.2
PWY-5484|g__Leuconostoc.s__Leuconostoc_mesenteroides    25.3

This shows which organisms carry this pathway and how much each contributes.

Why Stratification Matters

Stratification reveals important ecological patterns:

Functional redundancy - Multiple species carrying the same function provides resilience. If one species declines, others maintain the function.

Functional specialization - A function concentrated in one species creates vulnerability. Loss of that species means loss of function.

Temporal dynamics - During fermentation or succession, dominant contributors may change even if total pathway abundance stays constant.

Mechanistic understanding - Knowing which organisms perform which functions enables targeted manipulation or prediction of community behavior.

Example: Lactate Fermentation in Coffee

During coffee fermentation, lactic acid bacteria produce lactate, contributing to flavor. Stratification might reveal:

Early fermentation (T0-T16): Leuconostoc species dominate lactate production
Late fermentation (T24-T48): Lactobacillus species take over

The total lactate fermentation pathway abundance might stay constant, but the contributors change. This is only visible through stratification!


Hands-On: Exploring Coffee Fermentation Functional Profiles

Prerequisites

Make sure you have:

  • Access to the climb shared project
  • Data symlinked / copied to your workspace
export WSUSER=/shared/team/users/{your_name}/
cd $WSUSER

# Link to precomputed HUMAnN3 results
ln -s /shared/team/2025_training/week5/tutorial/Session3_Functional/Single_sample .

ls -lh func_precomputed/humann3/
  • RStudio access on the notebook

The Dataset

Coffee bean fermentation time series from Ecuador:

  • T0 (0 hours): Initial colonization, diverse community
  • T16 (16 hours): Early fermentation, Leuconostoc emergence
  • T24 (24 hours): Active fermentation, lactic acid bacteria dominance
  • T48 (48 hours): Late fermentation, mature community

Biological question: How does microbial metabolism shift during fermentation? Which pathways increase or decrease? Do different species contribute at different stages?


Exploring HUMAnN3 Results in R

Open RStudio and Load the Script

Link to R Markdown: HUMAnN3 Pathway Exploration

Copy the script to your workspace:

cd $WSUSER
cp /shared/team/2025_training/week5/tutorial/Session3_Functional/Single_sample/Session3_humann3.Rmd .

Open RStudio and open the Session3_humann3.Rmd file.


Normalization Best Practices

Different analyses require different normalizations. Here’s a guide:

Gene Families

Raw output: RPK (reads per kilobase) - normalized by gene length only

For comparison across samples: Convert to CPM

humann_renorm_table --input genefamilies.tsv --output genefamilies_cpm.tsv --units cpm

For compositional analysis: Convert to relative abundance

humann_renorm_table --input genefamilies.tsv --output genefamilies_relab.tsv --units relab

Pathway Abundances

Raw output: Already in CPM (copies per million)

For compositional analysis: Convert to relative abundance if needed

For visualization: Often log-transform for better dynamic range (add pseudocount first)

General Rules

Comparing same feature across samples: CPM normalization

Comparing different features within a sample: Relative abundance (sums to 1)

Statistical testing: Depends on the tool - DESeq2 uses raw counts, ALDEx2 uses CLR-transformed

Always document: State clearly which normalization you used in methods and figure legends!


Functional Database Regrouping

HUMAnN3 outputs use UniRef90 IDs by default, but you can regroup to other functional classification systems.

Available Regrouping Options

GO (Gene Ontology) - Biological process, molecular function, cellular component categories. Useful for broad functional categories.

EC (Enzyme Commission) - Enzyme classification numbers. Good for enzyme-centric analyses.

KEGG (Kyoto Encyclopedia of Genes and Genomes) - KEGG Orthology groups and KEGG pathways. Widely used, good for integration with other databases.

Pfam - Protein family classifications. Useful for domain-level analyses.

eggNOG - Evolutionary genealogy classifications. Good for phylogenetic context.

How to Regroup

# Regroup to EC numbers
humann_regroup_table --input genefamilies.tsv --output genefamilies_ec.tsv --groups uniref90_ec

# Regroup to Gene Ontology
humann_regroup_table --input genefamilies.tsv --output genefamilies_go.tsv --groups uniref90_go

# Regroup to KEGG
humann_regroup_table --input genefamilies.tsv --output genefamilies_ko.tsv --groups uniref90_ko

When to Use Each

Use GO for high-level functional categories and comparing broad functional classes.

Use EC for enzyme-focused questions and connecting to metabolomics data.

Use KEGG for pathway integration and connecting to other omics data.

Use multiple if you want different perspectives on the same data - they’re complementary!


Resources

HUMAnN3 Documentation: https://huttenhower.sph.harvard.edu/humann
HUMAnN3 Tutorial: https://github.com/biobakery/biobakery/wiki/humann3
bioBakery Forum: https://forum.biobakery.org/
MetaCyc Database: https://metacyc.org/
Pathway Visualization: HUMAnN3 includes utilities for creating pathway diagrams