This page assumes you already know Slurm basics and just need the QIB/NBI-specific mapping: which partitions to use, what limits to expect, and the commands that explain “why is my job pending / slow”.

Mental model: where to do what

  • Login nodes (normally 4 behind a load balancer) are currently offline.
  • Use the software node only for (ssh software from a login node):
    • editing job scripts
    • compiling software
    • downloading containers/data

Actual compute happens on compute nodes via Slurm partitions (queues).

QIB partitions: what to pick

Quick decision table

Use case Partition Time limit
quick tests / small jobs qib-short 02:00:00
most “real” analyses qib-medium 2-00:00:00
very long jobs qib-long unlimited MaxTime (default is long)
interactive debugging qib-interactive up to 90-00:00:00
GPU workloads qib-gpu unlimited MaxTime (default 7-00:00:00)

Hardware notes (from current node view)

  • QIB standard nodes (q512n*) are 84 CPUs, ~512 GB RAM.
  • Big-memory nodes exist:
    • q1024n1 ~1 TB RAM
    • q1536n* ~1.5 TB RAM (some are 192 CPUs)
    • q4096n2/q4096n3 ~4 TB RAM (128 CPUs)
  • GPU node: q2048n1 has 2× A100 (gres/gpu=2), ~2 TB RAM, 32 CPUs.

NBI-wide partitions you might hear about

You may see these in documentation or in sinfo output:

  • nbi-short (2h), nbi-medium (2d), nbi-long (unlimited), nbi-interactive
  • nbi-download (max 14 days; default 7 days) — typically for controlled “download” from a limited list of whitelisted sites
  • nbi-compute — an “overlay” partition over multiple NBI nodes (policy/admin use varies)

QIB users generally use QIB partitions unless you have a specific reason/policy to use NBI-wide ones.


Some commands

To see if some partitions are unavailable or under heavy load:

sinfo -o "%20P %10a %10l %8D %8t %20F %N"
  • List all nodes, CPUs, memory, GPUs, and features
sinfo -N -o "%20N %10t %10c %12m %20G %30f %20P"
  • Partition policy (limits, node lists, allowed groups)
scontrol show partition qib-short
scontrol show partition qib-medium
scontrol show partition qib-long
scontrol show partition qib-gpu

Submitting jobs (QIB patterns)

  • Single-task, multi-threaded tool (typical bioinformatics)

Use --cpus-per-task, keep --ntasks=1 unless you really run MPI.

#!/usr/bin/env bash
#SBATCH --job-name=example_mt
#SBATCH --partition=qib-medium
#SBATCH --time=2-00:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err

set -euo pipefail

mytool --threads "$SLURM_CPUS_PER_TASK" input.fq.gz

Submit:

sbatch example.sbatch

GPU jobs (A100)

💡 You need to ask access to the GPU queue to Core Bioinformatics first

Request GPUs via GRES:

#!/usr/bin/env bash
#SBATCH --job-name=gpu_job
#SBATCH --partition=qib-gpu
#SBATCH --time=1-00:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --gres=gpu:1
#SBATCH --output=slurm-%j.out

set -euo pipefail

nvidia-smi
python train.py

Interactive work (debugging / short sessions)

Allocate:

interactive

Then run commands as usual.


Diagnosing jobs

1) List your jobs and reasons for queue (to get individual JobIDs)

squeue -u "$USER" -o "%.18i %.9P %.12j %.2t %.10M %.10l %.6D %R"

2) Why is this pending / where is it running / what did I request?

scontrol show job <JOBID>

3) After completion: CPU/memory usage and exit status

sacct  -o JobID,JobName%30,Partition,State,Elapsed,Timelimit,AllocCPUS,ReqMem,MaxRSS,ExitCode -j <JOBID>

Institute boundaries / partitions

NBI is composed of four institutes; each institute has partitions (e.g. you may see ei-* jobs in squeue). As a QIB user, you usually submit to qib-* partitions. If you can see other partitions, it just means the cluster is shared and visible; access is governed by group membership in the partition config (AllowGroups=).


Previous submodule: