inhoinno.github.io  /  serving notes

KV‑Cache × HBM Tier Visualizer

How prompt length, concurrency and turns push KV cache out of GPU memory and down the tiering hierarchy — HBM → CPU DRAM → remote DRAM → disk mount — for a given model under a chosen HBM utilization.

Model & Hardware

Workload

KV / token
weights
HBM KV pool
peak KV
lands in L2
lands in L3
lands in L4
HBM mem used
HBM L2 CPU DRAM L3 remote DRAM L4 disk mount beyond / OOM per‑session context (tokens) shared prefix (tokens)