Branch Prediction: The CPU's Crystal Ball

macro photography of black circuit board

6 min read

Modern CPUs depend on branch prediction to keep deep pipelines full, executing speculatively before knowing which path is correct.

Predictor designs have evolved from two-bit counters to TAGE and perceptron-based hybrids that exceed 97 percent accuracy on typical workloads.

Mispredictions cost 13 to 20 cycles on contemporary cores, and the disruption to speculation amplifies the apparent penalty in dense branch-heavy code.

Branchless techniques—conditional moves, bitwise selection, and SIMD masking—convert unpredictable control flow into deterministic data dependencies.

The discipline of performance engineering lies in distinguishing branches the hardware handles for free from those that demand intervention.

Modern processors execute instructions at rates that would have seemed implausible two decades ago, yet much of this throughput depends on a quiet act of prophecy. Every conditional branch—every if, every loop termination, every virtual dispatch—forces the CPU to guess what comes next, fetching and executing instructions before it knows whether they should run at all.

When the prediction holds, the pipeline stays full and the speculative work commits to architectural state. When it fails, dozens of in-flight instructions are squashed, the pipeline drains, and the front end refills from the correct target. The cost is not measured in nanoseconds but in lost opportunity: tens of cycles during which the execution units sit idle.

This article examines branch prediction as a performance phenomenon. We will trace the evolution of predictor microarchitectures from two-bit saturating counters to perceptron and TAGE-based designs, quantify the penalty of mispredictions on contemporary out-of-order cores, and survey the techniques—conditional moves, predication, bitwise selection, SIMD masking—that allow practitioners to eliminate unpredictable branches entirely. The objective is not to demonize branching, which remains essential to control flow, but to develop a working model of when prediction succeeds, when it fails catastrophically, and how to write code that respects the constraints of the underlying hardware.

Predictor Architectures: From Counters to Neural Networks

The earliest dynamic predictors were two-bit saturating counters indexed by the lower bits of the program counter. The state machine biases prediction toward the recent direction and requires two consecutive mispredictions to flip its output, providing hysteresis that handles loop boundaries gracefully. For workloads with strongly biased branches, accuracy approached 90 percent—remarkable for so little silicon.

Two-level adaptive predictors, introduced by Yeh and Patt, captured correlations between branches by maintaining a global or per-branch history register. The history indexes a table of counters, allowing the predictor to distinguish patterns such as taken-taken-not-taken from not-taken-taken-not-taken. The gshare variant XORs the history with the program counter to reduce aliasing in the prediction table, yielding accuracies above 95 percent on SPEC workloads.

Modern predictors are hybrids. TAGE (TAgged GEometric) predictors maintain multiple tables indexed by histories of geometrically increasing length, selecting the longest matching history that produces a confident prediction. This design exploits the observation that some branches require deep history while others are well-predicted by short patterns, and it dominates contemporary academic competitions.

Perceptron predictors apply a linear classifier to the global history, summing weights associated with each historical bit. Their key advantage is that storage scales linearly with history length rather than exponentially, enabling histories of hundreds of bits. AMD's Zen cores deploy perceptron-based designs; Intel's recent microarchitectures combine TAGE-like and neural components in undisclosed proportions.

The frontier is no longer raw accuracy—top predictors exceed 97 percent on typical code—but the long tail of hard-to-predict branches: data-dependent comparisons, indirect calls in dynamic dispatch, and irregular pointer chasing. These residual mispredictions dominate the performance impact, since the easy branches are already handled.

Takeaway
Branch predictors are sophisticated learning systems, but their accuracy is a lognormal distribution: a few unpredictable branches contribute disproportionately to your performance loss.

The Cost of Being Wrong

On a deeply pipelined out-of-order core, the misprediction penalty is bounded below by the depth of the front-end and the latency of resolving the mispredicted branch in the back end. On Intel Skylake-derived microarchitectures this is approximately 16 to 20 cycles; on Apple's Firestorm and recent AMD Zen cores it ranges from 13 to 19. These figures translate to four to six nanoseconds of pure pipeline bubble per misprediction.

The architectural penalty understates the true cost. A misprediction also discards the speculative work performed downstream: prefetches issued, cache lines warmed, dependent computations partially completed. When the correct path is fetched, those side effects must be repeated, and the reorder buffer must refill before instruction-level parallelism resumes its full throughput.

Speculative execution hides some of this cost when mispredictions are rare and isolated. The out-of-order engine continues retiring older instructions while the front end recovers, and a single misprediction in an otherwise predictable region may cost less than its nominal penalty. Conversely, dense regions of unpredictable branches—a sorting network on random data, for instance—suffer compounding stalls.

A useful first-order model: if a branch mispredicts with probability p and the penalty is P cycles, the expected cost per execution is p · P. For a branch mispredicting 10 percent of the time on a 17-cycle pipeline, this is 1.7 cycles of overhead per execution, which can easily double the latency of a tight loop body.

Profiling tools expose this directly. Linux's perf reports branch-misses and br_misp_retired; Intel VTune visualizes mispredictions per source line. The first step in any optimization effort targeting branch behavior is measurement: assumptions about which branches are problematic are routinely wrong.

Takeaway
The cost of a misprediction is not the cycle penalty alone but the disruption it inflicts on the speculation window that hides everything else.

Branchless Code: Computing Without Conditionals

When a branch resists prediction—typically because it depends on data with no exploitable structure—eliminating it can yield substantial speedups. The simplest tool is the conditional move: cmov on x86, csel on ARM. These instructions select between two register values based on a flag without altering control flow, converting a potential 17-cycle penalty into a fixed two-cycle data dependency.

Compilers generate cmov opportunistically, but their heuristics are conservative. The transformation is unprofitable when the branch is well-predicted, since cmov always evaluates both inputs and serializes them through a dependency. The optimization is most effective when the branch is genuinely random and the inputs are cheap to compute.

Arithmetic and bitwise tricks eliminate branches by encoding decisions in data. min(a, b) can be expressed as b ^ ((a ^ b) & -(a < b)), exploiting the fact that the comparison produces a zero or all-ones mask. Sign extraction, absolute value, and saturating arithmetic admit similar formulations. These idioms are valuable in inner loops where even a predicted branch competes for front-end bandwidth.

SIMD takes branchlessness further by applying masked operations across vector lanes. A comparison produces a mask; a blend instruction selects between two vectors per lane. AVX-512 introduces explicit predicate registers, allowing per-lane masking at the instruction level. Algorithms once dominated by control flow—filtering, partitioning, sparse gather—admit data-parallel formulations that retire one element per lane per cycle.

Branchless code is not universally superior. It increases instruction count, often computes both sides of a decision, and can serialize what would otherwise be predicted away for free. The discipline is to identify the branches whose elimination pays for itself—typically the unpredictable ones in hot loops—and leave the rest to the predictor.

Takeaway
Branchlessness is not a virtue but a tool; apply it where the predictor fails, and trust the hardware where it succeeds.

Branch prediction is the substrate on which modern instruction throughput rests. Without it, the pipelines that give contemporary CPUs their performance would stall on every conditional, reducing effective IPC by an order of magnitude. With it, well-structured code runs as if branches were free.

The performance engineer's task is to recognize the boundary between these regimes. Branches whose outcomes correlate with history, exhibit strong bias, or follow regular patterns will be predicted with near-perfect accuracy and need no intervention. Branches driven by random or adversarial data will mispredict consistently, and there the techniques of predication, bitwise selection, and SIMD masking pay genuine dividends.

The deeper principle is that high-performance code is written in dialogue with the microarchitecture. Understanding what the hardware predicts well, what it predicts poorly, and what it cannot predict at all is not optional knowledge for the systems engineer—it is the vocabulary in which performance is expressed.