The Engineering Behind Efficient Inference Serving

6 min read

Deploying large language models at scale is fundamentally a systems engineering challenge, not just a machine learning one.

Continuous batching replaces static request grouping with iteration-level scheduling, improving GPU throughput by 2–8× by eliminating idle time from variable-length requests.

PagedAttention applies virtual memory concepts to KV cache management, reducing memory waste from 60–80% to under 4% and enabling significantly higher concurrency.

The trade-off between latency and throughput requires explicit architectural decisions — the same model needs different serving configurations for real-time chat versus batch processing.

The most impactful inference optimizations borrow heavily from classical operating systems and distributed systems research, applied to the unique constraints of transformer architectures.

Deploying a large language model is not the same as training one. Training is a batch job — you throw compute at a dataset and wait. Inference serving is an engineering problem with real-time constraints, cost pressures, and wildly unpredictable demand. The difference matters more than most teams realize until their GPU bill arrives.

A single H100 GPU costs roughly $30,000. If your serving system keeps it 20% utilized while waiting for requests to finish, you're burning money at an alarming rate. The challenge is that language model requests vary dramatically — a short query might need 50 tokens, a document summary might need 4,000. Naive approaches leave expensive hardware idle.

The systems-level optimizations that solve this problem — continuous batching, paged memory management, and careful latency-throughput calibration — are what separate a research demo from an economically viable product. These aren't model improvements. They're infrastructure improvements. And right now, they're where much of the real engineering is happening.

Continuous Batching: Ending the Tyranny of the Slowest Request

Traditional batching is simple: collect a group of requests, process them together, return all results when the last one finishes. This works fine when all requests are roughly the same size. For language models, they almost never are. A request generating 20 tokens sits idle while the system waits for another request generating 2,000 tokens to complete. Static batching means your fastest requests are held hostage by your slowest.

Continuous batching — sometimes called iteration-level batching — solves this by operating at the granularity of individual decode steps rather than complete requests. When a request finishes generating tokens, its slot is immediately freed and a new request from the queue is inserted. The batch is never static; it's a living, rotating set of in-progress requests sharing GPU cycles at every forward pass.

The impact on GPU utilization is dramatic. Research from systems like Orca showed that continuous batching can improve throughput by 2–8× over static batching under realistic workloads. The improvement scales with variance in request length — the more diverse your traffic, the bigger the win. For production systems handling a mix of chat completions, code generation, and summarization, this variance is the norm, not the exception.

Implementation isn't trivial, though. Continuous batching requires a scheduler that can manage request insertion and removal mid-batch without corrupting the attention state of active requests. It also requires careful handling of the prefill phase (processing the input prompt) versus the decode phase (generating tokens one at a time), since prefill is compute-bound and decode is memory-bound. Systems like vLLM and TensorRT-LLM handle this by separating or interleaving these phases, but the scheduling policy itself becomes a critical design decision.

Takeaway
The unit of scheduling determines the unit of waste. Moving from request-level to iteration-level batching doesn't change the model — it changes how idle time accumulates, and that changes everything about cost.

PagedAttention: Treating GPU Memory Like an Operating System

Every transformer-based model maintains a key-value cache — a running record of attention states for each token generated so far. This KV cache grows linearly with sequence length and must be stored in GPU memory for every active request. For a 13-billion-parameter model serving a 2,048-token sequence, a single request's KV cache can consume several gigabytes. Multiply that by a batch of concurrent requests, and GPU memory becomes the binding constraint long before compute does.

The naive approach allocates a contiguous block of memory for each request's maximum possible sequence length upfront. This is wasteful. A request that ends at 200 tokens still reserves memory for 2,048 tokens. Internal fragmentation — wasted space within allocated blocks — can consume 60–80% of available KV cache memory. External fragmentation compounds the problem when variable-length requests leave scattered gaps that can't be reused.

PagedAttention, introduced by the vLLM project, borrows a concept directly from operating system virtual memory management. Instead of allocating contiguous memory per request, it divides the KV cache into fixed-size pages (typically holding 16 tokens each). Pages are allocated on demand as tokens are generated and tracked via a page table. Non-contiguous physical pages can back a logically contiguous sequence, just as virtual memory maps scattered physical RAM into a linear address space for a process.

The results are significant. PagedAttention reduces KV cache memory waste to under 4%, compared to 60–80% with contiguous allocation. This directly translates to serving more concurrent requests on the same hardware. It also enables advanced features like prefix caching — where shared system prompts across requests can reference the same physical pages — and efficient beam search, where candidate sequences share most of their KV cache and only diverge at recent tokens. Memory management, unglamorous as it sounds, became one of the highest-leverage optimizations in the inference stack.

Takeaway
When your scarce resource isn't compute but memory, the memory allocator becomes the most important piece of your system. The best ideas in AI infrastructure are often borrowed from decades-old operating systems research.

Latency vs. Throughput: Choosing Your Optimization Target

Every serving system faces a fundamental tension: do you optimize for how fast a single user gets a response, or how many users you can serve per second? These goals are not the same, and the architectural decisions that favor one often hurt the other. Choosing wrong doesn't just reduce performance — it misaligns your infrastructure with your product.

Latency-sensitive applications — real-time chatbots, coding assistants, interactive agents — need low time-to-first-token (TTFT) and consistent inter-token latency. This means smaller batch sizes, aggressive prefill scheduling, and sometimes dedicating GPUs to fewer concurrent requests. Throughput-oriented applications — batch document processing, offline summarization, evaluation pipelines — benefit from maximizing GPU saturation with large batches, even if individual request latency increases. The same model, on the same hardware, needs fundamentally different serving configurations for these two cases.

The trade-off gets quantitative fast. Increasing batch size from 1 to 32 might improve throughput by 10× while only increasing per-request latency by 2×. But pushing from 32 to 128 might add another 2× throughput at the cost of 5× latency. The curve is non-linear, and the optimal operating point depends entirely on your service-level objectives. Systems like Sarathi-Serve and DeepSpeed-FastGen introduce chunked prefill — breaking long prefill operations into smaller pieces interleaved with decode steps — specifically to prevent a single large prompt from spiking latency for all co-batched requests.

Model parallelism adds another dimension. Tensor parallelism across GPUs reduces per-request latency by splitting computation but introduces communication overhead that limits throughput scaling. Pipeline parallelism increases throughput by running different requests through different model stages simultaneously, but adds latency from pipeline bubbles. The right parallelism strategy — or combination of strategies — depends on whether your SLA is defined in milliseconds or requests per dollar. There is no universally optimal configuration; there is only the right configuration for your constraints.

Takeaway
Optimization without a clear objective function is just tinkering. Before tuning any serving parameter, define whether you're solving for the user who's waiting or the budget that's burning — because the answers diverge quickly.

The models get the headlines, but the serving systems determine whether anyone can actually afford to use them. Continuous batching, paged memory management, and principled latency-throughput calibration are the engineering layer that turns a research artifact into a product.

What's striking is how much of this work draws from classical systems engineering — schedulers, memory allocators, pipeline design. The AI inference stack is converging with decades of operating systems and distributed systems wisdom. The teams building the best serving infrastructure are often the ones with the deepest roots in traditional systems design.

As models continue to grow, these optimizations won't become less important — they'll become existential. The gap between a well-engineered serving system and a naive one isn't 10% efficiency. It's the difference between viable and bankrupt.