CRDT Internals: Conflict-Free Distributed Data

macro photography of black circuit board

8 min read

CRDTs achieve conflict-free convergence through join semilattice algebra, where commutativity, associativity, and idempotence absorb the failure modes of asynchronous networks.

State-based and operation-based CRDTs offer dual dissemination strategies unified by the same lattice convergence guarantee but with fundamentally different bandwidth and delivery requirements.

Counter implementations like G-Counter and PN-Counter expose O(n) space costs per replica that become problematic in large-scale deployments, driving designs like delta-state CRDTs.

Set CRDTs face the tombstone accumulation problem, where deletion markers grow without bound and eventually dominate merge costs and storage.

Tombstone garbage collection via causal stability reintroduces background coordination, revealing that CRDTs shift complexity rather than eliminate it.

In distributed systems, the fundamental tension between availability and consistency has driven decades of engineering trade-offs. Traditional approaches rely on coordination—consensus protocols, distributed locks, serializable transactions—to maintain a single coherent view of shared state. But coordination has a cost measured in latency, availability loss during partitions, and operational complexity that scales poorly. Conflict-free replicated data types, or CRDTs, take a radically different path: they make coordination unnecessary by construction.

The core insight behind CRDTs is not merely practical but algebraic. By constraining the structure of data and the operations upon it to satisfy specific mathematical properties, we guarantee that replicas converge to identical states regardless of message ordering, duplication, or temporary network partitions. There is no conflict resolution logic because conflicts are structurally impossible. The mathematics does the work that protocols traditionally do.

This article examines the internal machinery that makes CRDTs function. We begin with the semilattice algebra that underpins convergence guarantees, then analyze the space and merge complexity trade-offs in counter implementations, and finally confront the most persistent engineering challenge in set CRDTs: tombstone accumulation and the techniques required to reclaim space without violating correctness. These are not introductory concepts—they are the details that determine whether a CRDT deployment succeeds or collapses under its own metadata weight at scale.

Semilattice Mathematics

A CRDT's convergence guarantee rests on a single algebraic structure: the join semilattice. Formally, a join semilattice is a set S equipped with a binary join operation ⊔ that is commutative (a ⊔ b = b ⊔ a), associative ((a ⊔ b) ⊔ c = a ⊔ (b ⊔ c)), and idempotent (a ⊔ a = a). These three properties are not arbitrary—they correspond precisely to the failure modes of asynchronous networks. Commutativity absorbs message reordering. Associativity absorbs arbitrary grouping of concurrent updates. Idempotence absorbs message duplication.

The join operation induces a partial order on states: we say a ≤ b if and only if a ⊔ b = b. This partial order captures the notion of information monotonicity—states only move forward in the lattice, never backward. When a replica receives a remote state and computes the join with its local state, the result is the least upper bound of both. This is the mathematical core of strong eventual consistency: any two replicas that have received the same set of updates—in any order, with any duplication—will compute the same least upper bound and therefore hold identical states.

State-based CRDTs (CvRDTs) exploit this directly. Each replica maintains a local state drawn from the semilattice, applies mutations locally that monotonically advance the state, and periodically ships its full state to other replicas. The merge function is simply the lattice join. Convergence follows immediately from the algebraic properties—no vector clocks, no causal delivery, no ordering assumptions beyond eventual delivery of at least one copy of each state.

Operation-based CRDTs (CmRDTs) take a dual approach. Instead of shipping states, they ship operations. Here the requirement shifts: the operations themselves must commute when applied concurrently. The underlying delivery layer must guarantee causal delivery (operations from the same source arrive in order) and exactly-once delivery, or the operations must be designed to be idempotent. This trades the bandwidth cost of full state transfer for stricter delivery requirements—a trade-off that matters significantly at scale.

The choice between CvRDT and CmRDT is not merely stylistic. For data types with large states and small updates—think a document with thousands of elements where a single character changes—operation-based designs are dramatically more bandwidth-efficient. For data types with compact states or unreliable delivery infrastructure, state-based designs are simpler and more robust. The semilattice abstraction unifies both: whether you merge states or replay operations, the algebraic guarantee of convergence is the same. The lattice is the invariant; the dissemination strategy is the variable.

Takeaway
Convergence in CRDTs is not an emergent property of clever protocols—it is a direct consequence of algebraic structure. If your merge function forms a join semilattice, convergence is a theorem, not a hope.

Counter Implementations

The simplest nontrivial CRDT is the G-Counter (grow-only counter). Each of n replicas maintains a vector of n non-negative integers. Replica i increments only its own entry. The counter's value is the sum of all entries. The merge function takes the component-wise maximum of two vectors. This satisfies the semilattice properties trivially: max is commutative, associative, and idempotent. The space complexity is O(n) per counter, and merge is O(n)—both linear in the number of replicas.

To support decrements, the PN-Counter pairs two G-Counters: one for increments (P) and one for decrements (N). The counter's value is sum(P) − sum(N). Merge applies component-wise maximum independently to each vector. This doubles the space to O(2n) per counter and doubles the merge cost, but it cleanly separates the two monotonically increasing quantities. The constraint is that the counter's value can go negative—there is no built-in mechanism to enforce a non-negative invariant without coordination, which is a fundamental limitation of purely coordination-free designs.

The O(n) space cost of these vector-based counters becomes problematic in systems with large or dynamic replica sets. Consider a mobile application with millions of devices each acting as a replica—storing a million-entry vector per counter is untenable. Dotted version vectors and server-mediated architectures address this by reducing the effective replica count. In practice, client devices communicate through a bounded set of server nodes, each maintaining its own vector entry. The counter's logical accuracy is preserved while the physical vector size is bounded by the server count, not the client count.

State-based counters pay a bandwidth cost proportional to O(n) per synchronization, shipping the full vector. Delta-state CRDTs optimize this by transmitting only the portion of state that changed since the last synchronization—the delta. For a G-Counter where only one entry changed, the delta is O(1). Delta-state dissemination requires tracking which deltas each peer has received, introducing a modest bookkeeping cost, but it reduces steady-state bandwidth by orders of magnitude in systems with frequent small updates.

Operation-based counters invert the trade-off entirely. An increment operation is a single message—O(1) bandwidth per update—but requires the delivery layer to guarantee causal ordering and exactly-once semantics. In practice, achieving exactly-once delivery in a distributed system is itself a hard problem, often requiring persistent message logs and deduplication infrastructure. The choice between state-based, delta-state, and operation-based counter designs is therefore not about which is "better" in the abstract, but about which failure and cost model matches your deployment. The algebra is constant; the engineering is contextual.

Takeaway
Every CRDT counter design trades space against bandwidth against delivery guarantees. The right implementation depends on your replica topology and failure model, not on the data type alone.

Tombstone Management

Set CRDTs expose a problem that counters never face: how do you represent removal in a monotonically growing structure? In a grow-only set (G-Set), elements are only added, and merge is set union. But practical systems need removal. The two-phase set (2P-Set) solves this with two G-Sets—one for additions, one for removals. An element is in the set if it appears in the add-set but not the remove-set. The cost is severe: once removed, an element can never be re-added. The remove-set entries are tombstones—persistent markers of deletion that must be retained indefinitely to prevent removed elements from reappearing when stale add-messages arrive.

The Observed-Remove Set (OR-Set) addresses the re-addition limitation by tagging each addition with a unique identifier—typically a combination of replica ID and a local sequence number. Removal deletes specific tagged instances, not the element itself. Re-adding the same element generates a new tag, which is unaffected by prior removals. This is semantically superior but exacerbates the tombstone problem: every removal produces a tombstone that maps a unique tag to a deletion marker, and these tombstones accumulate without bound as elements are repeatedly added and removed.

The tombstone accumulation problem is not merely a storage concern—it directly impacts merge performance. Merging two OR-Set states requires comparing their tombstone sets to determine which tagged additions have been removed. As tombstone sets grow, merge cost grows with them, eventually dominating the cost of the actual live data. In long-running systems with high churn—chat applications, collaborative editors, shopping carts—tombstone volume can exceed live data volume by orders of magnitude.

Causal stability provides the theoretical foundation for tombstone garbage collection. A tombstone is causally stable at a replica when that replica knows that every other replica has observed the deletion. At that point, the tombstone can be safely discarded because no future message can carry an add-operation that the tombstone was needed to suppress. Determining causal stability requires tracking the causal frontier of each replica—typically via vector clocks or similar mechanisms. This reintroduces a form of coordination, but it is background coordination that does not block updates or sacrifice availability.

In practice, tombstone garbage collection is one of the hardest operational challenges in CRDT deployments. Replicas that go offline for extended periods prevent tombstones from becoming stable, as the system cannot confirm they have observed the deletion. Solutions range from replica eviction policies—declaring long-absent replicas dead and removing them from the causal stability calculation—to epoch-based compaction, where the system periodically checkpoints a full state and discards all tombstones prior to the checkpoint, requiring reconnecting replicas to re-synchronize from the checkpoint rather than merging incrementally. Each approach trades correctness guarantees against operational pragmatism, and the right choice depends on the tolerance for stale replicas and the cost of full re-synchronization.

Takeaway
Tombstones are the debt CRDTs pay for coordination-free deletion. Managing that debt—through causal stability tracking, replica eviction, or epoch compaction—is where theoretical elegance meets operational reality.

CRDTs are not magic—they are a disciplined application of algebraic constraints to distributed state. The semilattice guarantees convergence as a mathematical property, counter implementations reveal the space-bandwidth-delivery trade-off space, and tombstone management exposes the ongoing cost of coordination-free deletion.

The recurring theme is that CRDTs shift complexity, not eliminate it. The coordination cost you avoid at write time resurfaces as metadata overhead, merge complexity, and garbage collection challenges. Understanding where that complexity lands—and whether your system can absorb it—is the real engineering decision.

For systems architects evaluating CRDTs, the question is never whether they "work" in theory. The semilattice math is settled. The question is whether your replica topology, churn rate, and operational constraints align with the specific costs that each CRDT variant imposes. The algebra gives you correctness for free. Everything else, you pay for.