Concurrent programming presents a deceptively simple question: given a collection of asynchronous processes communicating through shared memory, which synchronization primitives are powerful enough to coordinate them? The answer, formalized by Maurice Herlihy in 1991, reshaped how we reason about concurrent computation.
Herlihy's consensus number hierarchy establishes a strict ranking among shared-memory primitives based on their ability to solve the consensus problem in a wait-free manner. The consensus problem requires n processes, each starting with a private input value, to agree on a single output value, with the guarantee that the chosen value was proposed by some participant. Wait-freedom demands that every process completes its operation in a finite number of steps, regardless of the speeds or failures of others.
What emerges from this framework is striking: synchronization primitives are not interchangeable conveniences. They form a discrete hierarchy of computational power, where lower-level primitives cannot simulate higher-level ones, no matter how cleverly composed. This impossibility result has profound implications for hardware design, language runtime construction, and the very architecture of lock-free data structures. Understanding the hierarchy is not merely an academic exercise; it determines what concurrent objects can be built correctly on a given machine.
The Consensus Hierarchy
The consensus number of an object X, denoted CN(X), is the maximum number of processes for which X, together with atomic read/write registers, can solve wait-free consensus. If no such maximum exists, CN(X) = ∞. This single invariant induces a complete classification of synchronization primitives.
At the base of the hierarchy lie atomic read/write registers, with consensus number 1. Fischer, Lynch, and Paterson's impossibility result, adapted to the shared-memory model, establishes that registers alone cannot solve consensus even for two processes. The proof proceeds by a bivalence argument: any purported algorithm admits an execution where a single process's step transitions the system between configurations consistent with both possible decisions, and an adversarial scheduler can exploit this to prevent termination.
The next level contains primitives with consensus number 2, including test-and-set, fetch-and-add, swap, and stacks and queues. These objects can coordinate two processes to agree but fail for three. The proof exhibits a critical state in which three processes are each poised to perform an operation, and case analysis on which process moves first reveals that two processes cannot distinguish the resulting configurations, contradicting the consensus requirement.
Higher levels are populated by more powerful primitives. The k-element atomic assignment, which writes simultaneously to k distinct registers, has consensus number 2k − 2. Atomic n-process buffer objects achieve consensus number n. The hierarchy is strict: each level contains objects that cannot be implemented from any combination of lower-level objects.
This stratification reveals a fundamental asymmetry. Implementation flows downward—any object can be built from sufficiently powerful primitives—but not upward. A queue, despite its apparent richness, cannot solve three-process consensus, and therefore cannot implement any object with consensus number 3 or higher.
TakeawayComputational power in concurrent systems is quantized, not continuous. The primitives your hardware provides impose an absolute ceiling on which wait-free objects you can construct, regardless of algorithmic ingenuity.
Compare-and-Swap and Universal Construction
At the apex of the hierarchy resides compare-and-swap (CAS), along with its close relative load-linked/store-conditional (LL/SC). These primitives have consensus number ∞: they solve wait-free consensus for any number of processes. The proof is constructive and surprisingly direct.
Consider n processes, each with a proposed value. A single CAS register, initialized to a sentinel ⊥, suffices. Each process executes CAS(⊥, my_value); the first to succeed installs its value, and all subsequent attempts fail. Every process then reads the register and adopts the installed value as the consensus decision. Termination is wait-free because each process performs a bounded number of steps; agreement follows because all processes read the same register; validity holds because the installed value was proposed by some participant.
Herlihy's deeper result is that any object with consensus number ∞ is universal: it can implement any sequentially specified concurrent object in a wait-free manner. The universal construction maintains a linked list of operations representing the object's history. Each process prepares its operation, then competes via CAS to append it to the shared list. Once appended, every process can apply the sequence to a local replica to compute its response.
This universality theorem is the theoretical justification for the centrality of CAS in modern systems. Lock-free queues, hash tables, skip lists, and software transactional memory all ultimately reduce to CAS-based coordination. The primitive is not merely convenient; it is provably sufficient for the entire universe of wait-free concurrent objects.
The construction comes with caveats. Naive universal constructions are inefficient, requiring each process to potentially replay long operation histories. Practical lock-free algorithms exploit object-specific structure to avoid this cost. Yet the existence proof remains foundational: it establishes that the question of what can be implemented is settled, leaving only the question of how efficiently.
TakeawayCompare-and-swap is not just useful—it is computationally complete for wait-free concurrency. Any synchronization problem you can specify, CAS can solve in principle.
From Theory to Silicon
The consensus hierarchy is not a purely theoretical construct. It directly shapes the design of processor instruction sets and the algorithms that can run on them. Hardware architects, whether explicitly aware of the theory or not, must provide primitives of sufficient consensus power to support the concurrent abstractions their platforms aim to enable.
Early multiprocessor architectures often shipped with test-and-set as their sole atomic primitive. Such systems can implement mutual exclusion but cannot construct wait-free objects beyond two processes. This limitation manifested practically: lock-free data structures of meaningful generality were impossible without elaborate workarounds, and any contention required falling back to blocking synchronization.
Modern instruction set architectures—x86, ARM, RISC-V, Power—all provide primitives with infinite consensus number. The x86 architecture exposes CMPXCHG and its variants. ARM and Power use LL/SC pairs (load-exclusive/store-exclusive on ARM, lwarx/stwcx on Power). RISC-V offers both LR/SC and atomic memory operations including CAS. This convergence is not coincidental: it reflects the theoretical necessity of universality for general-purpose concurrent computing.
Subtle distinctions remain. LL/SC offers theoretical advantages over CAS, sidestepping the ABA problem—the pathology where a memory location returns to a previously observed value, deceiving CAS into believing nothing has changed. However, hardware LL/SC implementations typically impose restrictions on the operations permitted between LL and SC, weakening this theoretical advantage in practice. Engineers must reason carefully about which specific guarantees their target architecture actually provides.
The hierarchy also illuminates why certain emerging primitives, such as hardware transactional memory, are theoretically attractive. They provide consensus number ∞ with composable semantics, potentially simplifying the construction of complex concurrent objects that CAS-based algorithms accomplish only through intricate manual reasoning.
TakeawayHardware design and concurrent algorithm design are bound by the same theoretical constraints. The atomic instructions in your processor manual encode decades of consensus theory in silicon.
The consensus number hierarchy reveals a remarkable structural truth: synchronization is not a matter of taste or convenience but of mathematical capability. Each primitive occupies a definite position in a strict ordering, and that position determines what concurrent objects it can construct.
This framework converts vague intuitions about concurrent programming into precise theorems. We no longer ask whether a primitive feels powerful enough; we compute its consensus number and consult the hierarchy. The questions of possible and impossible become decidable, leaving engineering creativity to operate within well-defined boundaries.
For system architects, the lesson is foundational. Designing a concurrent system without understanding the consensus power of its primitives is designing without knowing what one is permitted to build. The hierarchy is not an obstacle to circumvent but a map of the territory itself.