Simulations
Interactive diagrams for concepts I work on: CXL memory tiering, NVMe FDP placement, virtual-memory translation, and vLLM PagedAttention. Click the controls under each diagram to step through.
CXL — coherent memory tiering
CXL.mem attaches memory to the host across PCIe with cache coherence, sitting between local DRAM and SSD on the latency curve. The diagram below issues a load to the chosen tier; the dot's travel time scales logarithmically with real-world latency.
Animation duration is log-scaled so you can feel the gap; raw latency is shown numerically.
FDP — Flexible Data Placement on NVMe SSDs
The host writes a stream of pages with mixed lifetimes — some long-lived (green), some short-lived and soon to be invalidated (orange). The two SSDs receive the same workload. Without FDP, lifetimes mix in every erase block. With FDP, the host hints a Reclaim Unit Handle (RUH) per write so pages with similar lifetimes share blocks.
A block is reclaimed when full; valid pages are relocated, then the block is erased. Mixed-lifetime blocks force relocations and inflate WAF.
Virtual memory — page-table walk and TLB
The CPU emits a virtual address split into a virtual page number (VPN) and offset. The TLB caches recent VPN→PFN translations; on a miss, the MMU walks the page table to find the physical frame, then refills the TLB.
Single-level page table for clarity (real x86-64 has four levels). TLB uses LRU eviction.
vLLM — PagedAttention KV cache
Each LLM request grows its KV cache one token at a time. Reserving the maximum sequence length up front (left) wastes GPU memory because most requests finish early. PagedAttention (right) splits the KV cache into fixed-size blocks and allocates them on demand from a shared pool, with each request keeping a logical-to-physical block table — like a page table for the KV cache.
Block size = 4 tokens. Pool = 24 blocks. Requests randomly finish each step (geometric).
