Serving large language models efficiently in production is a hard problem. Traditional inference engines waste up to 80% of GPU memory through KV cache fragmentation and over-allocation, leading to poor throughput and high costs. This talk dives into PagedAttention, a key innovation that borrows virtual memory paging from operating systems as a solution to this and vLLM, the open-source engine built on top of it.
This talk will cover the theory, walk through using vLLM in practice, and look at benchmark results showing up to 24× throughput improvements. We'll close with a look at how vLLM has been rapidly adopted across the industry and why PagedAttention has become a foundational primitive in LLM serving.