Large language models generate text one token at a time. Each token requires a full forward pass through billions of parameters. This sequential dependency creates a fundamental latency problem that parallelism alone cannot solve.
Speculative decoding offers an elegant solution: use a small, fast model to draft multiple tokens, then verify them all at once with the large model. When the draft is good, you've generated several tokens for the cost of one large-model pass. When it's wrong, you fall back gracefully.
The technique is mathematically lossless—the output distribution remains identical to standard autoregressive generation. You're not trading quality for speed. You're exploiting the fact that verification is cheaper than generation when you can batch multiple candidates together.
Autoregressive Latency Problem
Modern LLMs generate tokens sequentially because each new token depends on all previous ones. The model must complete a full forward pass—attention across the entire context, feedforward computations through every layer—before producing a single output token.
This creates latency that scales with sequence length, not batch size. You can process more sequences in parallel by adding GPUs, but you cannot make a single sequence generate faster through parallelism alone. The dependency chain is the bottleneck.
Memory bandwidth compounds the problem. Large models spend most of their time moving weights from memory to compute units. A 70B parameter model reads hundreds of gigabytes per forward pass. Even with fast interconnects, this memory wall dominates generation time.
The key insight is that most of this computational work produces relatively predictable outputs. Common words, grammatical structures, and obvious continuations don't require billions of parameters to predict correctly. Yet we pay the full computational cost for every token, whether it's surprising or mundane.
TakeawaySequential dependencies create latency floors that raw compute cannot break. The only path to faster generation is reducing the number of expensive operations per token generated.
Draft Model Strategy
Speculative decoding introduces a two-stage process. A small draft model—perhaps 1-2 billion parameters—generates K candidate tokens quickly. The large target model then processes all K tokens in a single parallel forward pass.
The verification step computes the target model's probability distribution at each position. If the draft token matches what the target would have generated, it's accepted. If not, the target model's distribution is sampled instead, and remaining draft tokens are discarded.
This works because transformer attention allows parallel processing of known sequences. The target model can score positions 1 through K simultaneously—it just can't generate them in parallel because each generation requires knowing the previous token.
Draft models can be purpose-built or derived from the target through techniques like layer pruning or distillation. The ideal draft model maximizes acceptance rate while minimizing latency overhead. A model that's too large wastes time drafting; one that's too small produces rejected tokens.
TakeawayVerification is inherently more parallelizable than generation. Speculative decoding exploits this asymmetry by shifting work from sequential generation to parallel verification.
Acceptance Rate Dynamics
Speedup depends critically on how often the target model accepts draft tokens. If the draft model matches the target's distribution perfectly, every token is accepted and speedup approaches K (the speculation length). If distributions diverge, rejections cascade and gains disappear.
Acceptance probability follows the target distribution by construction. The algorithm samples from the difference between target and draft distributions when rejecting, ensuring the final output is statistically identical to pure target model generation.
Temperature and sampling strategies affect acceptance rates dramatically. Greedy decoding (temperature near zero) produces higher acceptance because both models converge toward the same high-probability tokens. High-temperature sampling creates divergence, reducing effective speedup.
Optimal K depends on acceptance rates and relative model speeds. Longer speculation amortizes draft overhead but risks more wasted verification work. Adaptive schemes adjust K dynamically based on recent acceptance history, maximizing throughput across varying content difficulty.
TakeawayAcceptance rate determines whether speculative decoding helps or hurts. The technique amplifies efficiency when predictions are easy and degrades gracefully when they're hard.
Speculative decoding represents a shift in how we think about LLM optimization. Instead of making individual operations faster, it reduces how many expensive operations we need.
The technique is particularly powerful for latency-sensitive applications—interactive chat, real-time translation, code completion—where users wait for each response. Batch throughput applications benefit less since they can already saturate compute.
As models grow larger and the gap between model sizes widens, speculative decoding becomes increasingly attractive. It's a rare optimization that costs nothing in output quality while delivering substantial real-world speedups.