Simulations

Interactive diagrams for concepts I work on: CXL memory tiering, NVMe FDP placement, virtual-memory translation, and vLLM PagedAttention. Click the controls under each diagram to step through.

CXL — coherent memory tiering

CXL.mem attaches memory to the host across PCIe with cache coherence, sitting between local DRAM and SSD on the latency curve. The diagram below issues a load to the chosen tier; the dot's travel time scales logarithmically with real-world latency.

CPU core + caches Local DRAM ~80 ns · ~32 GB · DDR5 CXL.mem (Type 3) ~250 ns · ~512 GB · coherent over PCIe NVMe SSD ~50 µs · ~4 TB · block-addressed
last latency: avg latency: requests: 0

Animation duration is log-scaled so you can feel the gap; raw latency is shown numerically.

FDP — Flexible Data Placement on NVMe SSDs

The host writes a stream of pages with mixed lifetimes — some long-lived (green), some short-lived and soon to be invalidated (orange). The two SSDs receive the same workload. Without FDP, lifetimes mix in every erase block. With FDP, the host hints a Reclaim Unit Handle (RUH) per write so pages with similar lifetimes share blocks.

Without FDP — single placement With FDP — RUH per lifetime
host writes: 0 NAND writes (no FDP): 0 NAND writes (FDP): 0 WAF (no FDP): 1.00 WAF (FDP): 1.00

A block is reclaimed when full; valid pages are relocated, then the block is erased. Mixed-lifetime blocks force relocations and inflate WAF.

Virtual memory — page-table walk and TLB

The CPU emits a virtual address split into a virtual page number (VPN) and offset. The TLB caches recent VPN→PFN translations; on a miss, the MMU walks the page table to find the physical frame, then refills the TLB.

VA = 0x00000 VPN | offset TLB (4 entries) Page table (8 entries) Physical memory
accesses: 0 TLB hits: 0 TLB misses: 0 hit rate:

Single-level page table for clarity (real x86-64 has four levels). TLB uses LRU eviction.

vLLM — PagedAttention KV cache

Each LLM request grows its KV cache one token at a time. Reserving the maximum sequence length up front (left) wastes GPU memory because most requests finish early. PagedAttention (right) splits the KV cache into fixed-size blocks and allocates them on demand from a shared pool, with each request keeping a logical-to-physical block table — like a page table for the KV cache.

Contiguous reservation PagedAttention
step: 0 contiguous util: 0% paged util: 0% contiguous oom: 0 paged oom: 0

Block size = 4 tokens. Pool = 24 blocks. Requests randomly finish each step (geometric).