Kraken2 & Bracken: K-mer Based Taxonomic Classification

The Workshop materia;ls available to download Figshare

Theory Overview

What is Kraken2?

Kraken2 is an ultrafast taxonomic classification system that uses exact k-mer matching to assign taxonomic labels to metagenomic DNA sequences.

Key Features:

K-mer based classification (exact matching of short DNA sequences)
Extremely fast - can process millions of reads per minute
High sensitivity - detects more taxa than marker-based methods
Reports read counts at all taxonomic levels
Uses LCA (Lowest Common Ancestor) algorithm for ambiguous k-mers

How Does Kraken2 Work?

Database Construction
- K-mers (default: 35bp) extracted from reference genomes
- Each k-mer mapped to the lowest taxonomic level where it appears
- Database contains bacteria, archaea, viruses, plasmids, and optionally fungi, protozoa, plants
Classification Step
- Each read broken into k-mers
- K-mers queried against database
- LCA algorithm determines the most specific taxonomic assignment
- Confidence score calculated based on k-mer agreement
Output
- Per-read classifications (which taxon each read belongs to)
- Summary report with read counts at each taxonomic level

Kraken2 Modes

Standard mode: Examines all k-mers in a read
Quick mode (--quick): Stops after first k-mer hit - much faster, slightly less accurate

Advantages

✅ Very fast - Processes samples in minutes
✅ High sensitivity - Detects more taxa, including rare ones
✅ Low unclassified rate - Classifies more reads than MetaPhlAn4
✅ Comprehensive - Reports all taxonomic levels
✅ Flexible databases - Many pre-built options available

Limitations

⚠️ Higher false positive rate - Especially for low-abundance taxa
⚠️ Reports read counts - Not normalized for genome size
⚠️ High memory usage - Full databases require a good amount RAM (to run on the HPC) ⚠️ Many reads at higher levels - Reads classified to genus/family rather than species

What is Bracken?

The Problem Kraken2 Doesn’t Solve

Kraken2 is very good at classifying reads, but has a problem:

Many reads are classified at higher taxonomic levels (genus, family, order)
This happens when k-mers are shared across multiple species
We want species-level abundances for most analyses

Bracken’s Solution

Bracken (Bayesian Reestimation of Abundance with KrakEN) uses a probabilistic approach to:

Redistribute reads from higher levels down to species
Use expected k-mer distributions to estimate the true species of origin
Produce more accurate species-level abundance estimates

How Bracken Works

Pre-computation (already done for workshop databases)
- For each species in database, simulate reads of length X (e.g., 150bp)
- Classify simulated reads with Kraken2
- Calculate: “What % of this species’ reads get classified to genus vs species?”
Re-estimation (what you run)
- Takes Kraken2 report as input
- Uses Bayesian inference to redistribute higher-level reads
- Filters out low-confidence species (below threshold)
- Outputs species-level abundance estimates

Bracken Advantages

✅ Better species-level estimates than raw Kraken2
✅ Reduces false positives through thresholding
✅ Fast - runs in seconds
✅ Conservative - filters out poorly supported assignments

Important Note

⚠️ Bracken still reports READ COUNTS, not genome-size-normalized abundances
⚠️ For relative abundances, you need to convert: (reads / total_reads) × 100

When to Use Kraken2 + Bracken?

When you need comprehensive detection of all taxa
For novel or diverse environments where sensitivity matters
When you want to minimize unclassified reads
For rapid screening of many samples
When combined with Bracken, provides balanced specificity and sensitivity

Hands-On: Running Kraken2 and Bracken

Prerequisites

Make sure you have:

Access to the workshop notebook
Data symlinked/copied to your workspace
Kraken2 conda environment available

conda deactivate

conda activate /shared/team/conda/aliseponsero.mmb-dtp/kraken2

RStudio access on the notebook

Part 1: Understanding Kraken2 Databases

Database Organization

Kraken2 requires a pre-built database. For this workshop, we’re using the standard_8gb database located at:

/shared/team/2025_training/week5/databases/kraken2/

Key files in the database:

hash.k2d - K-mer hash table (the main database file)
opts.k2d - Options and parameters used to build the database
taxo.k2d - Taxonomy information
database*mers.kmer_distrib - Bracken k-mer distribution files (for different read lengths)

To explore:

ls -lh /shared/team/2025_training/week5/databases/kraken2/

Available Pre-built Databases

Kraken2 offers many pre-built database options (see https://benlangmead.github.io/aws-indexes/k2):

Standard databases:

k2_standard - Bacteria, archaea, viruses, plasmid, human (~137 GB)
k2_standard_8gb - Reduced standard (~8 GB) ← We’re using this
k2_standard_16gb - Reduced standard (~16 GB)

PlusPF series:

k2_pluspf - Standard + protozoa + fungi
k2_pluspf_8gb - Reduced PlusPF

💡 Database choice matters! Larger = more sensitivity, but slower and more memory.

Part 2: Running Kraken2

Step 1: Activate the Environment

conda deactivate  # If you were in metaphlan environment
conda activate /shared/team/conda/aliseponsero.mmb-dtp/kraken2

Step 2: Check Installation

kraken2 --version

Step 3: View Help

kraken2 --help

Key parameters to understand:

--db - Path to Kraken2 database
--threads - Number of threads to use
--quick - Quick mode (faster, slightly less sensitive)
--report - Output report file (summary - this is what we analyze!)
--output - Output classification file (per-read classifications - large!)
--paired - Input files are paired-end reads

Step 4: Construct the Command

Template:

kraken2 --db [DATABASE_PATH] \
    --threads [N] \
    --report [REPORT_FILE] \
    --output [OUTPUT_FILE] \
    --paired [FORWARD] [REVERSE]

💡 Tip: The --output file is very large (one line per read). For most analyses, you only need the --report file.

This would normally take 5-15 minutes, but we’ll look at precomputed results.

Part 3: Understanding Kraken2 Output

Kraken2 Report Format

Open a Kraken precomputed output (e.g. T0_ERR2231567_profiles.txt)

The report file is tab-delimited with 6 columns:

%_READS  READS_CLADE  READS_TAXON  RANK  TAXID  NAME
5.43     1087         512          S     1234   Leuconostoc mesenteroides

Columns:

%_READS - Percentage of reads in this clade (including children)
READS_CLADE - Number of reads in this clade (including descendants)
READS_TAXON - Number of reads assigned directly to this taxon
RANK - Taxonomic rank:
- U = Unclassified
- R = Root
- D = Domain
- P = Phylum
- C = Class
- O = Order
- F = Family
- G = Genus
- S = Species
TAXID - NCBI taxonomy ID
NAME - Scientific name (indented to show hierarchy)

First line = Unclassified reads

5.21   10420   10420   U   0   unclassified

Key Differences from MetaPhlAn4

Feature	MetaPhlAn4	Kraken2
Output	Relative abundances (%)	Read counts
Unclassified	Usually high (30-70%)	Usually low (5-15%)
Species detected	Conservative (~40-50)	Liberal (~100+)
False positives	Low	Higher (especially at low abundance)

Part 4: Running Bracken

Step 1: Check Bracken Installation

bracken -v

Step 2: Check Available Read Lengths

Bracken databases are built for specific read lengths. Check what’s available:

ls /shared/team/2025_training/week5/databases/kraken2/database*mers.kmer_distrib

You should see files like:

database100mers.kmer_distrib (for 100bp reads)
database150mers.kmer_distrib (for 150bp reads)
database200mers.kmer_distrib (for 200bp reads)

Step 3: Understand Bracken Parameters

bracken -h

Key parameters:

-d - Path to Kraken2 database directory (same as Kraken2 --db)
-i - Input Kraken2 report file
-o - Output Bracken abundance file
-w - Output Kraken-style report with redistributed reads
-r - Read length used to build database (e.g., 150)
-l - Taxonomic level (S=species, G=genus, F=family, etc.)
-t - Threshold: minimum reads required (default: 10)

Step 4: Run Bracken

Template:

bracken -d [DATABASE_PATH] \
    -i [KRAKEN_REPORT] \
    -o [OUTPUT_FILE] \
    -w [BRACKEN_REPORT] \
    -r [READ_LENGTH] \
    -l S \
    -t 10

Example command:

mkdir test_bracken

bracken -d /shared/team/2025_training/week5/databases/kraken2 \
    -i Session1_profiling/Kraken2/T0_ERR2231567_profiles.txt \
    -o test_bracken/subset_bracken_species.txt \
    -w test_bracken/subset_bracken_species.report \
    -r 150 \
    -l S \
    -t 10

Part 5: Understanding Bracken Output

Bracken Abundance File Format

The main output file is tab-delimited with 7 columns:

name                          taxonomy_id  taxonomy_lvl  kraken_assigned_reads  added_reads  new_est_reads  fraction_total_reads
Leuconostoc mesenteroides     1245         S            450                    125          575            0.0287

Columns:

name - Species name
taxonomy_id - NCBI taxonomy ID
taxonomy_lvl - Taxonomic level (usually S for species)
kraken_assigned_reads - Reads Kraken2 assigned directly to this species
added_reads - Reads redistributed from higher levels ← KEY!
new_est_reads - Total estimated reads (kraken + added)
fraction_total_reads - Proportion of all classified reads

The magic is in added_reads! This shows which species gained reads that were originally classified at genus/family level.

Part 6: Exploring Outputs in RStudio

Open RStudio on your notebook and follow along!

Load Required Libraries

library(tidyverse)

Read Bracken Output

# Read the Bracken output
MYDIR="/shared/team/users/{your_name}/"

bracken <- read_tsv(paste(MYDIR, "Session1_profiling/Bracken/T0_ERR2231567_profiles.txt", sep="/"),
                    show_col_types = FALSE)

Compare Bracken to MetaPhlAn4:

# Subset Bracken species
bracken_rel <- bracken %>% filter(fraction_total_reads>0) %>%
    mutate(relative_abundance=fraction_total_reads*100) %>%
    arrange(desc(relative_abundance))

# Get MetaPhlAn4 species
profile_no_uncl <- read_tsv(paste(MYDIR, "Session1_profiling/Metaphlan/T0_ERR2231567.profile.txt", sep="/"), 
                    skip = 4,
                    show_col_types = FALSE) %>%
        rename("clade_name"=`#clade_name`)
        
metaphlan_species <- profile_no_uncl %>%
  filter(str_detect(clade_name, "s__"),
         !str_detect(clade_name, "t__")) %>%
  mutate(species_name = str_extract(clade_name, "s__[^|]+$")) %>%
  mutate(species_name = str_remove(species_name, "s__")) %>%
  mutate(species_name = str_replace_all(species_name, "_", " ")) %>%
  select(species_name, relative_abundance) %>%
  arrange(desc(relative_abundance))

# Compare species counts
cat("\n=== Method Comparison ===\n")
cat("MetaPhlAn4 species detected:", nrow(metaphlan_species), "\n")
cat("Bracken species detected:", nrow(bracken), "\n")

# Show top 10 from each method side-by-side
cat("\nTop 10 species comparison:\n")
comparison <- tibble(
  Rank = 1:10,
  MetaPhlAn4 = metaphlan_species$species_name[1:10],
  `MetaPhlAn4_%` = round(metaphlan_species$relative_abundance[1:10]),
  Bracken = bracken_rel$name[1:10],
  `Bracken_%` = round(bracken_rel$relative_abundance[1:10])
)

print(comparison)

Visualize Bracken Results

bracken_grouped <- bracken_rel %>%
    mutate(species_name=ifelse(relative_abundance<5, "Other", name)) %>%
    group_by(species_name) %>%
    summarize(relative_abundance=sum(relative_abundance)) %>%
    select(species_name, relative_abundance)


bracken_grouped %>%
  mutate(sample="T0") %>%
  ggplot(aes(x = sample, y = relative_abundance, fill=species_name)) +
  geom_bar(stat="identity") +
  coord_flip() +
  labs(title = "Species composition at T0",
       x = "Species",
       y = "Relative Abundance (%)") +
  theme_minimal()

Additional Resources

Kraken2

📖 Kraken2 Manual: https://github.com/DerrickWood/kraken2/wiki/Manual
📖 Pre-built Databases: https://benlangmead.github.io/aws-indexes/k2
📊 Kraken2 Paper: Wood et al. (2019) Genome Biology

Bracken

📖 Bracken GitHub: https://github.com/jenniferlu717/Bracken
📊 Bracken Paper: Lu et al. (2017) PeerJ Computer Science

Database Building

📖 Custom Kraken2 Databases: https://github.com/DerrickWood/kraken2/wiki/Manual#custom-databases
📖 KrakenTools: https://github.com/jenniferlu717/KrakenTools (for manipulating Kraken2/Bracken outputs)

Previous submodule:

MetaPhlAn4

Next submodule:

Pavian