The Workshop materia;ls available to download Figshare
Kraken2 is an ultrafast taxonomic classification system that uses exact k-mer matching to assign taxonomic labels to metagenomic DNA sequences.
Key Features:
Standard mode: Examines all k-mers in a read
Quick mode (--quick): Stops after first k-mer hit - much faster, slightly less accurate
✅ Very fast - Processes samples in minutes
✅ High sensitivity - Detects more taxa, including rare ones
✅ Low unclassified rate - Classifies more reads than MetaPhlAn4
✅ Comprehensive - Reports all taxonomic levels
✅ Flexible databases - Many pre-built options available
⚠️ Higher false positive rate - Especially for low-abundance taxa
⚠️ Reports read counts - Not normalized for genome size
⚠️ High memory usage - Full databases require a good amount RAM (to run on the HPC)
⚠️ Many reads at higher levels - Reads classified to genus/family rather than species
Kraken2 is very good at classifying reads, but has a problem:
Bracken (Bayesian Reestimation of Abundance with KrakEN) uses a probabilistic approach to:
✅ Better species-level estimates than raw Kraken2
✅ Reduces false positives through thresholding
✅ Fast - runs in seconds
✅ Conservative - filters out poorly supported assignments
⚠️ Bracken still reports READ COUNTS, not genome-size-normalized abundances
⚠️ For relative abundances, you need to convert: (reads / total_reads) × 100
Make sure you have:
conda deactivate
conda activate /shared/team/conda/aliseponsero.mmb-dtp/kraken2
Kraken2 requires a pre-built database. For this workshop, we’re using the standard_8gb database located at:
/shared/team/2025_training/week5/databases/kraken2/
Key files in the database:
hash.k2d - K-mer hash table (the main database file)opts.k2d - Options and parameters used to build the databasetaxo.k2d - Taxonomy informationdatabase*mers.kmer_distrib - Bracken k-mer distribution files (for different read lengths)To explore:
ls -lh /shared/team/2025_training/week5/databases/kraken2/
Kraken2 offers many pre-built database options (see https://benlangmead.github.io/aws-indexes/k2):
Standard databases:
k2_standard - Bacteria, archaea, viruses, plasmid, human (~137 GB)k2_standard_8gb - Reduced standard (~8 GB) ← We’re using this
k2_standard_16gb - Reduced standard (~16 GB)PlusPF series:
k2_pluspf - Standard + protozoa + fungik2_pluspf_8gb - Reduced PlusPF💡 Database choice matters! Larger = more sensitivity, but slower and more memory.
conda deactivate # If you were in metaphlan environment
conda activate /shared/team/conda/aliseponsero.mmb-dtp/kraken2
kraken2 --version
kraken2 --help
Key parameters to understand:
--db - Path to Kraken2 database--threads - Number of threads to use--quick - Quick mode (faster, slightly less sensitive)--report - Output report file (summary - this is what we analyze!)--output - Output classification file (per-read classifications - large!)--paired - Input files are paired-end readsTemplate:
kraken2 --db [DATABASE_PATH] \
--threads [N] \
--report [REPORT_FILE] \
--output [OUTPUT_FILE] \
--paired [FORWARD] [REVERSE]
💡 Tip: The --output file is very large (one line per read). For most analyses, you only need the --report file.
This would normally take 5-15 minutes, but we’ll look at precomputed results.
Open a Kraken precomputed output (e.g. T0_ERR2231567_profiles.txt)
The report file is tab-delimited with 6 columns:
%_READS READS_CLADE READS_TAXON RANK TAXID NAME
5.43 1087 512 S 1234 Leuconostoc mesenteroides
Columns:
U = UnclassifiedR = RootD = DomainP = PhylumC = ClassO = OrderF = FamilyG = GenusS = SpeciesFirst line = Unclassified reads
5.21 10420 10420 U 0 unclassified
| Feature | MetaPhlAn4 | Kraken2 |
|---|---|---|
| Output | Relative abundances (%) | Read counts |
| Unclassified | Usually high (30-70%) | Usually low (5-15%) |
| Species detected | Conservative (~40-50) | Liberal (~100+) |
| False positives | Low | Higher (especially at low abundance) |
bracken -v
Bracken databases are built for specific read lengths. Check what’s available:
ls /shared/team/2025_training/week5/databases/kraken2/database*mers.kmer_distrib
You should see files like:
database100mers.kmer_distrib (for 100bp reads)database150mers.kmer_distrib (for 150bp reads)database200mers.kmer_distrib (for 200bp reads)bracken -h
Key parameters:
-d - Path to Kraken2 database directory (same as Kraken2 --db)-i - Input Kraken2 report file-o - Output Bracken abundance file-w - Output Kraken-style report with redistributed reads-r - Read length used to build database (e.g., 150)-l - Taxonomic level (S=species, G=genus, F=family, etc.)-t - Threshold: minimum reads required (default: 10)Template:
bracken -d [DATABASE_PATH] \
-i [KRAKEN_REPORT] \
-o [OUTPUT_FILE] \
-w [BRACKEN_REPORT] \
-r [READ_LENGTH] \
-l S \
-t 10
Example command:
mkdir test_bracken
bracken -d /shared/team/2025_training/week5/databases/kraken2 \
-i Session1_profiling/Kraken2/T0_ERR2231567_profiles.txt \
-o test_bracken/subset_bracken_species.txt \
-w test_bracken/subset_bracken_species.report \
-r 150 \
-l S \
-t 10
The main output file is tab-delimited with 7 columns:
name taxonomy_id taxonomy_lvl kraken_assigned_reads added_reads new_est_reads fraction_total_reads
Leuconostoc mesenteroides 1245 S 450 125 575 0.0287
Columns:
The magic is in added_reads! This shows which species gained reads that were originally classified at genus/family level.
Open RStudio on your notebook and follow along!
library(tidyverse)
# Read the Bracken output
MYDIR="/shared/team/users/{your_name}/"
bracken <- read_tsv(paste(MYDIR, "Session1_profiling/Bracken/T0_ERR2231567_profiles.txt", sep="/"),
show_col_types = FALSE)
# Subset Bracken species
bracken_rel <- bracken %>% filter(fraction_total_reads>0) %>%
mutate(relative_abundance=fraction_total_reads*100) %>%
arrange(desc(relative_abundance))
# Get MetaPhlAn4 species
profile_no_uncl <- read_tsv(paste(MYDIR, "Session1_profiling/Metaphlan/T0_ERR2231567.profile.txt", sep="/"),
skip = 4,
show_col_types = FALSE) %>%
rename("clade_name"=`#clade_name`)
metaphlan_species <- profile_no_uncl %>%
filter(str_detect(clade_name, "s__"),
!str_detect(clade_name, "t__")) %>%
mutate(species_name = str_extract(clade_name, "s__[^|]+$")) %>%
mutate(species_name = str_remove(species_name, "s__")) %>%
mutate(species_name = str_replace_all(species_name, "_", " ")) %>%
select(species_name, relative_abundance) %>%
arrange(desc(relative_abundance))
# Compare species counts
cat("\n=== Method Comparison ===\n")
cat("MetaPhlAn4 species detected:", nrow(metaphlan_species), "\n")
cat("Bracken species detected:", nrow(bracken), "\n")
# Show top 10 from each method side-by-side
cat("\nTop 10 species comparison:\n")
comparison <- tibble(
Rank = 1:10,
MetaPhlAn4 = metaphlan_species$species_name[1:10],
`MetaPhlAn4_%` = round(metaphlan_species$relative_abundance[1:10]),
Bracken = bracken_rel$name[1:10],
`Bracken_%` = round(bracken_rel$relative_abundance[1:10])
)
print(comparison)
bracken_grouped <- bracken_rel %>%
mutate(species_name=ifelse(relative_abundance<5, "Other", name)) %>%
group_by(species_name) %>%
summarize(relative_abundance=sum(relative_abundance)) %>%
select(species_name, relative_abundance)
bracken_grouped %>%
mutate(sample="T0") %>%
ggplot(aes(x = sample, y = relative_abundance, fill=species_name)) +
geom_bar(stat="identity") +
coord_flip() +
labs(title = "Species composition at T0",
x = "Species",
y = "Relative Abundance (%)") +
theme_minimal()
📖 Kraken2 Manual: https://github.com/DerrickWood/kraken2/wiki/Manual
📖 Pre-built Databases: https://benlangmead.github.io/aws-indexes/k2
📊 Kraken2 Paper: Wood et al. (2019) Genome Biology
📖 Bracken GitHub: https://github.com/jenniferlu717/Bracken
📊 Bracken Paper: Lu et al. (2017) PeerJ Computer Science
📖 Custom Kraken2 Databases: https://github.com/DerrickWood/kraken2/wiki/Manual#custom-databases
📖 KrakenTools: https://github.com/jenniferlu717/KrakenTools (for manipulating Kraken2/Bracken outputs)