Raft Was Not Designed for Understandability: A Formal Analysis

red round fruit on white plastic container

8 min read

Raft's claim of being designed for understandability does not hold up under formal analysis of its TLA+ specification compared to Multi-Paxos.

The formal specifications reveal that Raft's state space and proof obligations are comparable to or larger than those of Multi-Paxos, with complexity relocated rather than reduced.

Raft's strong leader requirement eliminates concurrent reasoning but introduces formal constraints on availability and subtle correctness dependencies in leader election.

Production-critical mechanisms like membership changes and snapshotting carry significant hidden complexity that falls outside Raft's formally verified core specification.

The total formal complexity of a distributed protocol includes both its core specification and all extensions required for correct deployment—a measure by which Raft is not fundamentally simpler than Paxos.

Raft's original paper opens with a bold claim: the protocol was designed primarily for understandability. Diego Ongaro and John Ousterhout positioned it explicitly as a reaction to Paxos, arguing that the field needed a consensus algorithm people could actually reason about. The user studies in the paper appeared to validate this. Students found Raft easier to learn. The narrative stuck. Raft became the understandable consensus protocol.

But understandability is not a formal property. It cannot be verified, model-checked, or proved. When we examine Raft through the lens of formal specification—comparing its TLA+ models against Multi-Paxos, analyzing its state space, and tracing its correctness invariants—a different picture emerges. Raft does not reduce the fundamental complexity of consensus. It relocates it. The protocol makes certain aspects more intuitive by imposing structural constraints that introduce their own subtle, often overlooked complexity.

This is not a critique of Raft's engineering utility. Raft has proven enormously valuable in practice. But from a theoretical standpoint, the claim of simplification deserves rigorous scrutiny. What Raft actually achieves is a different set of complexity trade-offs—ones that privilege sequential reasoning over concurrent reasoning, strong leadership over flexible quorum structures, and implementation clarity over specification minimality. Understanding where complexity moves, rather than disappears, is essential for architects building mission-critical systems on Raft-based foundations.

Formal Comparison: Complexity Relocated, Not Reduced

The most direct way to compare Raft and Multi-Paxos is through their formal specifications. Lamport's TLA+ specification of Paxos is remarkably compact. The core safety property—agreement on a single value per slot—emerges from a small set of invariants over proposer ballot numbers and acceptor promises. Multi-Paxos extends this by running multiple Paxos instances, one per log slot, with an optimization that allows a stable leader to skip Phase 1 for subsequent slots.

Raft's TLA+ specification, written by Ongaro himself, tells a revealing story. The specification is larger than the corresponding Multi-Paxos specification. It carries more state variables, more action predicates, and more auxiliary invariants required to establish correctness. The log matching property—Raft's central structural invariant—requires a global induction argument across all servers and all log positions. This is not simpler in any formal sense. It is differently structured.

Where Paxos achieves safety through local constraints on ballot numbers and quorum intersection, Raft achieves safety through a global structural property: if two logs agree at a given index and term, they agree on all preceding entries. This is an elegant invariant, but maintaining it formally requires proving that leader append-only semantics, commitment rules, and election restrictions collectively preserve it. The proof obligation is substantial.

The real distinction is not complexity reduction but complexity partitioning. Paxos concentrates complexity in understanding how concurrent proposers interact through quorum overlap. Raft eliminates this concurrent reasoning by funneling all decisions through a single leader, but then must handle the complexity of leader transitions, log reconciliation during leadership changes, and the restriction that only leaders with up-to-date logs may be elected. One requires reasoning about concurrency; the other requires reasoning about sequential invariant preservation across state transitions.

Model checking confirms this equivalence in difficulty. When Raft and Multi-Paxos specifications are subjected to exhaustive state space exploration with tools like TLC, the state spaces are comparable in size for equivalent system configurations. Raft's state space is often larger due to the additional structure it maintains—term numbers, leader state, match indices—all of which expand the reachable state graph. The protocol is not simpler to verify. It is simpler to narrate.

Takeaway
Formal simplicity and narrative simplicity are different properties. A protocol that is easier to explain in a lecture may carry equal or greater proof obligations in its specification. Always distinguish between pedagogical clarity and specification minimality.

Leader Election: Strong Leadership as a Formal Constraint

Raft's strong leader requirement is its most consequential design decision. All client requests flow through the leader. All log replication originates from the leader. The leader's log is, by construction, authoritative. This eliminates the need to reason about concurrent proposals—a genuine cognitive simplification. But it introduces a formal constraint with deep implications: the system's availability is bounded by leader election latency and the properties of the election mechanism itself.

Formally, Raft's leader election must satisfy two properties simultaneously. First, election safety: at most one leader may be elected per term. Second, the leader completeness property: any elected leader must contain all committed entries from previous terms. The second property is enforced by Raft's voting restriction—a server denies its vote to any candidate whose log is less up-to-date. This restriction is the linchpin of Raft's correctness, and it is where significant formal subtlety hides.

The up-to-date comparison is lexicographic: last log term first, then log length. This seems straightforward, but its interaction with uncommitted entries creates edge cases that are formally delicate. A server may have entries from a deposed leader that were never committed. These entries must be overwritten during log reconciliation, but whether they are overwritten depends on which candidate wins the next election, which depends on the voting restriction, which depends on log state including these very entries. The formal argument that this circular dependency resolves correctly requires careful induction on term numbers.

The availability implications are equally significant. FLP impossibility tells us no deterministic consensus protocol can guarantee progress in an asynchronous system with even one crash failure. Both Raft and Paxos are subject to this bound. But Raft's strong leader requirement means that any period without an agreed-upon leader is a period of zero throughput. Multi-Paxos with flexible quorums can, in principle, make progress with concurrent proposers during leader instability. Raft cannot. The protocol trades availability headroom for structural simplicity.

Furthermore, Raft's election mechanism relies on randomized timeouts to break symmetry and prevent split votes. This is effective in practice but formally unsatisfying—the protocol's liveness argument reduces to a probabilistic claim about timeout distributions. Lamport's Paxos, by contrast, separates the liveness mechanism entirely from the safety proof, making the formal structure cleaner even if the implementation is less prescriptive. Raft bundles safety and liveness concerns together, which aids implementation but complicates formal reasoning about each property in isolation.

Takeaway
Strong leadership converts the problem of reasoning about concurrency into the problem of reasoning about transitions of authority. This is not simplification—it is a transformation that trades one class of formal difficulty for another, with measurable consequences for availability.

Hidden Complexity: Membership Changes and Snapshotting

The original Raft paper introduced joint consensus for membership changes—a mechanism where the cluster transitions through an intermediate configuration containing both old and new members. This was later simplified to single-server changes in practice, but the formal story is revealing. The joint consensus approach requires proving that safety holds across three overlapping quorum systems: old configuration, joint configuration, and new configuration. The proof that committed entries cannot be lost during this transition is non-trivial and involves subtle arguments about quorum intersection across configuration boundaries.

Single-server membership changes appear simpler but introduce their own formal pitfall. The mechanism is safe only if configuration changes are applied one at a time, with each change committed before the next is proposed. If this serialization is violated—something that is not enforced by the core Raft protocol but must be ensured by the implementation—safety can be violated. A formal analysis by Ongaro identified a specific bug where adding and removing servers concurrently could lead to disjoint majorities. This is precisely the kind of hidden complexity that a claim of understandability obscures.

Snapshotting presents analogous challenges. Raft's log cannot grow without bound, so implementations must periodically snapshot application state and truncate the log. But snapshotting interacts with nearly every other component of the protocol. A follower that falls far behind must receive a snapshot from the leader via an InstallSnapshot RPC. This mechanism is specified in the paper but its formal interaction with log matching, commitment, and leader completeness is not fully specified. Implementations must independently verify that snapshot installation preserves the log matching invariant.

The formal specification of Raft in TLA+ does not include snapshotting. This is a significant omission. Any production Raft implementation must include it, meaning that a substantial portion of the system's correctness argument falls outside the verified specification. Practitioners are left to reason about snapshot correctness informally—exactly the situation Raft was supposed to prevent. The gap between the formally specified core and the practically necessary extensions is wider in Raft than many engineers realize.

These hidden complexities illustrate a general principle in distributed systems design: a protocol's true complexity is determined by the union of its core specification and all mechanisms required for production deployment. Raft's core is cleanly specified, but the production surface area—membership changes, snapshotting, read-only queries with linearizability, pre-vote extensions to prevent disruption from partitioned nodes—adds substantial formal obligations that are not covered by the original understandability claim.

Takeaway
The formal complexity of a distributed protocol is not determined by its core specification alone but by everything required to deploy it correctly. Raft's hidden complexity in membership changes and snapshotting means that production implementations carry proof obligations far beyond what the original paper addresses.

Raft is an excellent protocol. Its engineering contributions are real, and its influence on the field has been overwhelmingly positive. But the claim that it was designed for understandability in any formal sense does not withstand rigorous scrutiny. What Raft achieves is a reorganization of consensus complexity into a form that aligns with sequential, leader-centric reasoning—a form that many engineers find more natural.

The formal analysis reveals that complexity is conserved. It moves from concurrent quorum reasoning to sequential invariant maintenance, from specification minimality to structural constraint management, from core protocol to production extensions. Recognizing this is not pedantic—it is essential for architects who must reason about the correctness boundaries of their systems.

The lesson extends beyond Raft: when evaluating any system design for simplicity, ask not whether it feels simpler, but whether its total formal obligations—core and peripheral—are genuinely reduced. Understandability and simplicity are not synonyms.