Try using Kraken2 (and maybe metaphlan or other tools as well) on very simple datasets.
This small exercise allows to tinker with Kraken2 in a simplified dataset.
We’ll download a reference genome, simulate sequencing reads, and classify them to verify the classification accuracy.
We’ll use Bilophila sp. as our test organism, simulating paired-end sequencing reads and classifying them with Kraken2.
datasets (NCBI Datasets command-line tools)seqfu (sequence manipulation toolkit)kraken2 (taxonomic classifier)You can create a new environment with the required tools:
mamba create -n kraken2-demo -c conda-forge -c bioconda \
ncbi-datasets-cli "seqfu>=1.20" kraken2>=2.0.8 krakentools
Download the Bilophila sp. 4_1_30 genome from NCBI using the datasets tool:
# Download genome assembly using accession number
datasets download genome accession GCF_000224655.1
# Extract the downloaded archive
unzip ncbi_dataset.zip
use ls and find to explore the extracted files. Identify the path to the FASTA file and use seqfu stats -nt to check the genome size and number of contigs.
Use seqfu shred to fragment the genome into simulated paired-end reads:
# Create output directory
mkdir kraken-test
# Generate synthetic paired-end reads
fu-shred -l 100 -s 3 \
-o kraken-test/bwa \
ncbi_dataset/data/GCF_000224655.1/GCF_000224655.1_Bilophila_sp_4_1_30_V1_genomic.fna
Parameters:
-l 100: Read length of 100 bp-s 3: Step size of 3 bp (one new fragment every 3 bases)-o kraken-test/bwa: Output prefix for generated filesOutput: Creates bwa_R1.fq and bwa_R2.fq (paired-end FASTQ files)
use gzip to compress the FASTQ files to save space.
Run Kraken2 to taxonomically classify the simulated reads:
# Navigate to output directory
cd kraken-test
# Run Kraken2 classification
kraken2 --db /shared/public/db/kraken2/pluspf_16gb/ \
--memory-mapping \
--threads 8 \
--report bwa.tsv \
--paired \
bwa_R1.fq bwa_R2.fq > bwa.raw
Parameters:
--db: Path to Kraken2 database (PlusPF 16GB version)--memory-mapping: Load database into memory for faster processing--threads 8: Use 8 CPU threads--report bwa.tsv: Generate human-readable classification report--paired: Process files as paired-end reads> bwa.raw: Redirect per-read classifications to fileOutputs:
bwa.raw: Per-read classification resultsbwa.tsv: Summary report with taxonomic distributionExplore tha “raw” output: this is the per-read classification made by Kraken2.
use less -S or vd to have a first look. Then you can use cut, sort, uniq -c on specific columns.
The report file (bwa.tsv) contains the summary of classifications at different taxonomic levels. It aggregates the results from the raw output.
Try using Metaphlan, can you check if there is a higher specificity (lower false positive rate) compared to Kraken2?