Distributed Snapshots: Capturing Global State Without Stopping the World

red round fruit on white plastic container

6 min read

The Chandy-Lamport algorithm solves the problem of capturing global state in distributed systems without halting execution.

Consistent cuts—partitions of execution history that respect causal ordering—provide the formal foundation for correct snapshots.

The algorithm works by propagating markers through FIFO channels, ensuring no causally later message is recorded before the snapshot point.

Captured states are not necessarily actual states but states that could have occurred, which is sufficient for detecting stable properties.

Modern systems like Apache Flink use Chandy-Lamport variants for checkpointing, validating the algorithm's practical power forty years after its creation.

In a distributed system, there is no global clock. No omniscient observer. No single moment where you can pause everything and ask: what is the state of this system right now? Yet this question matters enormously—for debugging, for checkpointing, for detecting whether certain conditions hold across the entire system.

The naive solution is to halt all processes simultaneously, record their states, and resume. But true simultaneity is impossible in distributed systems, and stopping the world destroys the very availability these systems are designed to provide. We need a more subtle approach: capturing a consistent global state while the system continues to run.

The Chandy-Lamport algorithm, published in 1985, solved this problem with mathematical elegance. It showed that you don't need to stop the world—you need to capture a state that could have occurred during normal execution. This distinction between what actually happened and what could have happened is the theoretical foundation that makes distributed snapshots possible. Understanding this algorithm means understanding one of the deepest insights in distributed systems theory.

Algorithm Derivation: From Consistent Cuts to Marker Passing

The Chandy-Lamport algorithm emerges from a precise requirement: we must capture a consistent cut of system execution. A cut partitions the execution history of each process into past and future. The global state we record consists of each process's state at its cut point, plus the state of all communication channels.

But not all cuts are consistent. Consider process P sending a message to process Q. If our cut places Q's state after receiving the message, but P's state before sending it, we've captured an impossible configuration—a message received but never sent. A consistent cut requires that if a receive event is in the past, its corresponding send must also be in the past.

The algorithm achieves consistency through a simple mechanism: markers. When a process decides to record its state (or receives its first marker), it immediately records its local state and sends a marker on all outgoing channels. Upon receiving a marker on a channel for the first time, a process records that channel's state as all messages received since it recorded its own state.

The correctness proof is elegant. Markers propagate through the system, and their FIFO ordering on channels guarantees that no message sent after the snapshot state can be recorded before the marker arrives. The cut thus respects the causal ordering of send and receive events.

Formally, if we define the happened-before relation → where a → b means event a causally precedes event b, then the algorithm captures a cut C where: for all events e₁, e₂, if e₁ → e₂ and e₂ is in the past of C, then e₁ is in the past of C. This is precisely the consistency condition we require.

Takeaway
The Chandy-Lamport algorithm transforms an impossibility—simultaneous observation—into a possibility by relaxing the requirement from 'what is the state right now' to 'what is a state that could have occurred.'

Consistency Conditions: The Mathematics of Could-Have-Happened

The theoretical power of Chandy-Lamport lies in what the captured state represents. It is not necessarily a state the system was ever in during actual execution. Rather, it is a state the system could have been in—consistent with the observable behavior and causal dependencies.

This requires formal precision. Define a global state as a tuple (s₁, s₂, ..., sₙ, c₁₂, c₁₃, ...) where sᵢ is the state of process i and cᵢⱼ is the state of the channel from i to j. A global state is reachable if there exists some execution from the initial state that produces it.

A consistent cut captures a reachable global state. The proof proceeds by construction: take any actual execution and permute the events to create an equivalent execution that passes through the recorded state. Two executions are equivalent if they have the same events and respect the same causal ordering. Since the cut is consistent, this permutation always exists.

This has profound implications for stable property detection. A property is stable if, once true, it remains true forever. Examples include deadlock, termination, and loss of a token. If a stable property holds in the snapshot, it must hold in the current (unknown) state—because the snapshot represents a reachable past state, and stable properties cannot become false.

The lattice structure of consistent cuts provides additional insight. All consistent cuts form a distributive lattice under the subset ordering. The snapshot algorithm captures one point in this lattice, but the mathematical structure reveals that many consistent observations are possible—each representing a valid perspective on the execution.

Takeaway
A distributed snapshot doesn't capture 'the' state of the system—it captures 'a' state the system could have been in, which is sufficient for reasoning about stable properties and system invariants.

Practical Applications: From Theory to Production Systems

The theoretical elegance of Chandy-Lamport translates directly into practical value. Modern distributed systems use snapshot algorithms for three primary purposes: checkpointing for fault tolerance, debugging distributed executions, and detecting stable properties like termination or deadlock.

Apache Flink's checkpointing mechanism is perhaps the most prominent industrial application. Flink implements a variant called asynchronous barrier snapshotting. Barriers (markers) flow through the data stream, and operators snapshot their state when barriers arrive from all input channels. The FIFO property of streams guarantees consistency—exactly the theoretical requirement Chandy and Lamport identified.

However, the original algorithm assumes FIFO channels, which not all systems provide. Extensions address this: the Lai-Yang algorithm handles non-FIFO channels by piggybacking color bits on messages. The Mattern algorithm uses vector clocks to track causality without FIFO assumptions. Each extension trades off complexity for weaker assumptions while preserving the core invariant of consistent cuts.

Snapshot algorithms also underpin global predicate detection. Given a predicate over global state—such as 'the sum of all bank balances equals the initial total'—we can check it against snapshots. For unstable predicates, more sophisticated algorithms explore the lattice of consistent cuts, but stable predicates require only a single snapshot.

The performance cost is primarily the marker messages themselves—O(|E|) messages where E is the set of channels. In practice, this overhead is acceptable because snapshots are infrequent relative to normal processing. The algorithm is also non-blocking: processes continue execution immediately after recording their state, never waiting for the snapshot to complete globally.

Takeaway
The gap between theoretical algorithm and production system is smaller than it appears—Flink's checkpointing is Chandy-Lamport with engineering polish, proving that formally correct algorithms can anchor real-world reliability.

The Chandy-Lamport algorithm exemplifies the power of formal thinking in distributed systems. By precisely defining what 'consistent state' means and proving that markers achieve it, we obtain an algorithm that is both theoretically sound and practically deployable.

The key insight transcends the specific algorithm: in distributed systems, we often cannot observe what is true, but we can observe what could have been true. This shift from definite to possible states is not a weakness—it's the foundation that makes global reasoning tractable.

Forty years after its publication, Chandy-Lamport remains the theoretical anchor for distributed checkpointing. Its durability reminds us that foundational insights, rigorously established, outlast any particular technology. The mathematics of consistent cuts will remain relevant as long as we build systems where time cannot be synchronized and state cannot be observed atomically.