How prompt length, concurrency and turns push KV cache out of GPU memory and down the
tiering hierarchy — HBM → CPU DRAM → remote DRAM → disk mount — for a given model under a chosen HBM utilization.
Model & Hardware
Workload
KV / token
—
weights
—
HBM KV pool
—
peak KV
—
lands in L2
—
lands in L3
—
lands in L4
—
HBM mem used
—
HBML2 CPU DRAML3 remote DRAML4 disk mountbeyond / OOMper‑session context (tokens)shared prefix (tokens)