“datasets” is a command-line tool to programmatically download genomes and annotations from NCBI.
NCBI “datasets” is a handy tool to quickly grab genome sequences and annotations from the NCBI database, straight to your laptop or HPC system.
It’s great if you want a fully packaged download—FASTA genome files, GFF annotation, protein FASTA file etc.
datasets commandFirst, make sure you have the command-line tool installed. If you’re using conda, you can install datasets like this:
conda create -n ncbi-datasets -c conda-forge ncbi-datasets-cli
conda activate ncbi-datasets
You can replace conda with mamba, or micromamba; depends on your setup
You can also use pre-built binaries, or just run datasets via Docker,
see datasets install instructions for other options.
We’ll use Lactococcus lactis as our example (a popular lactic acid bacterium).
If you don’t know the exact taxonomic name, you can search using NCBI’s web portal: https://www.ncbi.nlm.nih.gov/datasets/genomes
To download the latest reference genome and annotation for Lactococcus lactis, run:
datasets download genome taxon "Lactococcus lactis" \
--reference
genome taxon "Lactococcus lactis": Specify the organism (replace as you need)--reference: Download the reference genome (not just any assembly)This makes a ZIP file: ncbi_dataset.zip
Unzip the downloaded file to access all contents:
unzip ncbi_dataset.zip -d lactis/
use find to check what was extracted
Just change the taxon name. For a list of related genomes, you can specify higher-level taxa, e.g.:
datasets download genome taxon "Bifidobacterium"
this will download 11k genomes!
Or use NCBI Assembly Accession numbers like GCF_003176835.1:
datasets download genome accession GCF_003176835.1
to also download the annotation:
datasets download genome accession GCF_003176835.1 --include gff3,rna,cds,protein,genome,seq-report
For viral, fungal, or archaeal genomes—just swap in the taxon name.
All files are unpacked in your chosen directory.
You can check their formats using standard tools,
e.g. less genome.gff or head genome.fna.
For the Lactococcus lactis experiment you can try:
seqfu stats -bnt lactis/ncbi_dataset/data/GCF_003176835.1/GCF_003176835.1_ASM317683v1_genomic.fna
For multiple taxa in a bash script:
for taxon in "Lactococcus lactis" "Bifidobacterium animalis" "Streptococcus thermophilus"; do
datasets download genome taxon "$taxon" --reference --dehydrated
unzip ncbi-datasets-genomes-taxonomy-*.zip -d "${taxon// /_}/"
done