Kraken2 small exercise

Try using Kraken2 (and maybe metaphlan or other tools as well) on very simple datasets.

This small exercise allows to tinker with Kraken2 in a simplified dataset.

We’ll download a reference genome, simulate sequencing reads, and classify them to verify the classification accuracy.

Overview

We’ll use Bilophila sp. as our test organism, simulating paired-end sequencing reads and classifying them with Kraken2.

Prerequisites

datasets (NCBI Datasets command-line tools)
seqfu (sequence manipulation toolkit)
kraken2 (taxonomic classifier)
Access to a Kraken2 database (e.g., PlusPF)

You can create a new environment with the required tools:

mamba create -n kraken2-demo -c conda-forge -c bioconda \
    ncbi-datasets-cli "seqfu>=1.20" kraken2>=2.0.8 krakentools

Step 1: Download Reference Genome

Download the Bilophila sp. 4_1_30 genome from NCBI using the datasets tool:

# Download genome assembly using accession number
datasets download genome accession GCF_000224655.1

# Extract the downloaded archive
unzip ncbi_dataset.zip

use ls and find to explore the extracted files. Identify the path to the FASTA file and use seqfu stats -nt to check the genome size and number of contigs.

Step 2: Chop your genome in fragments

Use seqfu shred to fragment the genome into simulated paired-end reads:

# Create output directory
mkdir kraken-test

# Generate synthetic paired-end reads
fu-shred -l 100 -s 3 \
  -o kraken-test/bwa \
  ncbi_dataset/data/GCF_000224655.1/GCF_000224655.1_Bilophila_sp_4_1_30_V1_genomic.fna

Parameters:

-l 100: Read length of 100 bp
-s 3: Step size of 3 bp (one new fragment every 3 bases)
-o kraken-test/bwa: Output prefix for generated files

Output: Creates bwa_R1.fq and bwa_R2.fq (paired-end FASTQ files)

use gzip to compress the FASTQ files to save space.

Step 3: Classify Reads with Kraken2

Run Kraken2 to taxonomically classify the simulated reads:

# Navigate to output directory
cd kraken-test

# Run Kraken2 classification
kraken2 --db /shared/public/db/kraken2/pluspf_16gb/ \
   --memory-mapping \
   --threads 8 \
   --report bwa.tsv \
   --paired \
   bwa_R1.fq bwa_R2.fq > bwa.raw

Parameters:

--db: Path to Kraken2 database (PlusPF 16GB version)
--memory-mapping: Load database into memory for faster processing
--threads 8: Use 8 CPU threads
--report bwa.tsv: Generate human-readable classification report
--paired: Process files as paired-end reads
> bwa.raw: Redirect per-read classifications to file

Outputs:

bwa.raw: Per-read classification results
bwa.tsv: Summary report with taxonomic distribution

Exercise

Explore tha “raw” output: this is the per-read classification made by Kraken2.

use less -S or vd to have a first look. Then you can use cut, sort, uniq -c on specific columns.

The report file (bwa.tsv) contains the summary of classifications at different taxonomic levels. It aggregates the results from the raw output.

Try using Metaphlan, can you check if there is a higher specificity (lower false positive rate) compared to Kraken2?

Output files

Previous submodule:

Pavian