inhoinno.github.io / serving notes

KV‑Cache × HBM Tier Visualizer

How prompt length, concurrency and turns push KV cache out of GPU memory and down the tiering hierarchy — HBM → CPU DRAM → remote DRAM → disk mount — for a given model under a chosen HBM utilization.

Model & Hardware

model

weights

KV dtype

HBM total GiB

HBM util 0.90

off‑GPU tiers GiB

Workload

KV scope — what fills the tiers

sessions holding KV (concurrency) 8

num‑prompts (distinct, retained) 100

session‑turns 100

input‑len / turn 4096

output / turn 1024

max‑model‑length 32768

prefix shared ratio 0.00

KV / token

—

weights

—

HBM KV pool

—

peak KV

—

lands in L2

—

lands in L3

—

lands in L4

—

HBM mem used

—

HBM L2 CPU DRAM L3 remote DRAM L4 disk mount beyond / OOM per‑session context (tokens) shared prefix (tokens)

KV/token (GQA) = 2 · layers · kv_heads · head_dim · dtype_bytes | KV/token (MLA) = layers · (kv_lora_rank + rope_dim) · dtype_bytes
HBM KV pool = HBM_total · util − weights − overhead. Edit model specs in the MODELS object near the top of the script. Capacities are binary GiB (1 GiB = 2³⁰ B).