In this section, you’ll explore pre-computed HUMAnN3 results from a coffee bean fermentation time series. You’ll learn to:
The Workshop material is available to download in Figshare
HUMAnN3 produces three main output files per sample. Each file provides different but complementary information about the functional profile.
The genefamilies.tsv file contains abundances of gene families (UniRef90 IDs) detected in your sample.
Units: RPK (Reads Per Kilobase) - normalized by gene length to account for the fact that longer genes naturally get more reads.
Format:
# Gene Family RPK_Abundance
UniRef90_A0A123 45.2
UniRef90_A0A123|g__Lactobacillus.s__Lactobacillus_plantarum 40.1
UniRef90_A0A456 15.8
UNMAPPED 2500.0
Stratification: Lines without | are unstratified (community total). Lines with | show which species contribute to that gene family.
UNMAPPED: Reads that didn’t match any gene in the database. This is normal - metagenomic samples contain novel sequences not in reference databases.
The pathabundance.tsv file contains abundances of metabolic pathways from the MetaCyc database.
Units: CPM (Copies Per Million) - normalized by sequencing depth for fair comparison across samples.
How it’s calculated: HUMAnN3 maps gene families to pathways using MetaCyc definitions. If a pathway requires genes A, B, and C, and all are detected, the pathway abundance is calculated from the gene abundances.
Format:
# Pathway CPM_Abundance
PWY-5484 125.5
PWY-5484|g__Lactobacillus 100.2
PWY-5484|g__Leuconostoc 25.3
UNINTEGRATED 450.0
UNINTEGRATED: Genes detected but not part of any known pathway. This is normal - not all genes belong to characterized pathways.
The pathcoverage.tsv file shows what proportion of each pathway is present.
Units: Coverage (0-1) representing the fraction of pathway steps detected.
Purpose: Distinguishes complete pathways from incomplete ones. High abundance doesn’t always mean a complete pathway!
Format:
# Pathway Coverage
PWY-5484 0.95
PWY-6731 0.60
PWY-7332 0.30
Interpretation:
Important insight: A pathway with high abundance but low coverage may have just a few highly expressed genes from an incomplete pathway. Always check coverage!
One of HUMAnN3’s most powerful features is stratification - revealing which species contribute to each function.
Unstratified lines have no | symbol and represent community totals:
PWY-5484 125.5
This shows the total abundance of this pathway across all organisms.
Stratified lines contain | separating pathway from species:
PWY-5484|g__Lactobacillus.s__Lactobacillus_plantarum 100.2
PWY-5484|g__Leuconostoc.s__Leuconostoc_mesenteroides 25.3
This shows which organisms carry this pathway and how much each contributes.
Stratification reveals important ecological patterns:
Functional redundancy - Multiple species carrying the same function provides resilience. If one species declines, others maintain the function.
Functional specialization - A function concentrated in one species creates vulnerability. Loss of that species means loss of function.
Temporal dynamics - During fermentation or succession, dominant contributors may change even if total pathway abundance stays constant.
Mechanistic understanding - Knowing which organisms perform which functions enables targeted manipulation or prediction of community behavior.
During coffee fermentation, lactic acid bacteria produce lactate, contributing to flavor. Stratification might reveal:
Early fermentation (T0-T16): Leuconostoc species dominate lactate production
Late fermentation (T24-T48): Lactobacillus species take over
The total lactate fermentation pathway abundance might stay constant, but the contributors change. This is only visible through stratification!
Make sure you have:
export WSUSER=/shared/team/users/{your_name}/
cd $WSUSER
# Link to precomputed HUMAnN3 results
ln -s /shared/team/2025_training/week5/tutorial/Session3_Functional/Single_sample .
ls -lh func_precomputed/humann3/
Coffee bean fermentation time series from Ecuador:
Biological question: How does microbial metabolism shift during fermentation? Which pathways increase or decrease? Do different species contribute at different stages?
Link to R Markdown: HUMAnN3 Pathway Exploration
Copy the script to your workspace:
cd $WSUSER
cp /shared/team/2025_training/week5/tutorial/Session3_Functional/Single_sample/Session3_humann3.Rmd .
Open RStudio and open the Session3_humann3.Rmd file.
Different analyses require different normalizations. Here’s a guide:
Raw output: RPK (reads per kilobase) - normalized by gene length only
For comparison across samples: Convert to CPM
humann_renorm_table --input genefamilies.tsv --output genefamilies_cpm.tsv --units cpm
For compositional analysis: Convert to relative abundance
humann_renorm_table --input genefamilies.tsv --output genefamilies_relab.tsv --units relab
Raw output: Already in CPM (copies per million)
For compositional analysis: Convert to relative abundance if needed
For visualization: Often log-transform for better dynamic range (add pseudocount first)
Comparing same feature across samples: CPM normalization
Comparing different features within a sample: Relative abundance (sums to 1)
Statistical testing: Depends on the tool - DESeq2 uses raw counts, ALDEx2 uses CLR-transformed
Always document: State clearly which normalization you used in methods and figure legends!
HUMAnN3 outputs use UniRef90 IDs by default, but you can regroup to other functional classification systems.
GO (Gene Ontology) - Biological process, molecular function, cellular component categories. Useful for broad functional categories.
EC (Enzyme Commission) - Enzyme classification numbers. Good for enzyme-centric analyses.
KEGG (Kyoto Encyclopedia of Genes and Genomes) - KEGG Orthology groups and KEGG pathways. Widely used, good for integration with other databases.
Pfam - Protein family classifications. Useful for domain-level analyses.
eggNOG - Evolutionary genealogy classifications. Good for phylogenetic context.
# Regroup to EC numbers
humann_regroup_table --input genefamilies.tsv --output genefamilies_ec.tsv --groups uniref90_ec
# Regroup to Gene Ontology
humann_regroup_table --input genefamilies.tsv --output genefamilies_go.tsv --groups uniref90_go
# Regroup to KEGG
humann_regroup_table --input genefamilies.tsv --output genefamilies_ko.tsv --groups uniref90_ko
Use GO for high-level functional categories and comparing broad functional classes.
Use EC for enzyme-focused questions and connecting to metabolomics data.
Use KEGG for pathway integration and connecting to other omics data.
Use multiple if you want different perspectives on the same data - they’re complementary!
– HUMAnN3 Documentation: https://huttenhower.sph.harvard.edu/humann
– HUMAnN3 Tutorial: https://github.com/biobakery/biobakery/wiki/humann3
– bioBakery Forum: https://forum.biobakery.org/
– MetaCyc Database: https://metacyc.org/
– Pathway Visualization: HUMAnN3 includes utilities for creating pathway diagrams