Why Your Distributed Transactions Fail at Scale

macro photography of black circuit board

6 min read

Two-phase commit creates unavoidable blocking windows when coordinators fail during the prepared state, with failure probability scaling with participant count.

The FLP impossibility result proves no protocol can guarantee both safety and liveness under asynchronous network conditions with failures.

Saga patterns eliminate blocking by decomposing transactions into compensatable steps, but compensation complexity explodes for non-invertible operations.

Deterministic databases like Calvin eliminate coordination by pre-ordering transactions, but require all read-write sets to be known before execution.

Choosing between these approaches requires explicit trade-off analysis between consistency, availability, and architectural constraints.

Every distributed systems engineer eventually confronts the same brutal reality: the transaction model that worked flawlessly at modest scale collapses catastrophically under load. The database that handled thousands of transactions per second suddenly struggles with tens of thousands, not because of hardware limitations, but because of fundamental coordination overhead that grows superlinearly with system size.

The culprit is almost always two-phase commit or one of its variants. 2PC has dominated distributed transaction thinking for decades, offering the seductive promise of ACID guarantees across multiple nodes. But this promise comes with a hidden cost that only manifests at scale—a blocking behavior that transforms coordinator failures from minor inconveniences into system-wide unavailability events. Understanding why 2PC fails requires formal analysis of its failure modes and the theoretical limits they impose.

The good news is that alternatives exist, though each demands careful trade-off analysis. Saga patterns sacrifice atomicity for availability, requiring sophisticated compensation logic. Deterministic databases like Calvin take a more radical approach, eliminating coordination overhead entirely through predetermined transaction ordering. Neither is a drop-in replacement—both require architectural rethinking. This analysis examines each approach's theoretical foundations and practical implications for systems operating at massive scale.

The Blocking Pathology of Two-Phase Commit

Two-phase commit's elegance conceals a fundamental flaw: it creates an unavoidable blocking window where participant nodes cannot make progress without coordinator input. During the interval between a participant voting to commit and receiving the coordinator's final decision, that participant holds locks on resources and cannot unilaterally abort or commit. If the coordinator fails during this window, participants must wait indefinitely.

Formally, consider a 2PC execution with coordinator C and participants P₁...Pₙ. After all participants vote YES, they enter the prepared state. In this state, participant Pᵢ has promised to commit if C decides COMMIT, but cannot release locks because C might decide ABORT (if Pᵢ was the last to vote and C failed before logging its decision). The participant cannot safely abort either, because other participants may have already committed based on a decision Pᵢ never received.

The probability of hitting this blocking window scales with transaction duration and participant count. If each participant's prepared-state duration is τ and the coordinator failure probability per unit time is λ, the blocking probability for n participants is approximately 1 - e^(-nλτ). At modest scale, nλτ is negligible. At massive scale with thousands of participants and millisecond-sensitive latencies, this probability becomes operationally significant.

Three-phase commit attempts to address blocking by adding a pre-commit phase, but it only eliminates blocking under specific failure assumptions—namely, that network partitions don't occur simultaneously with node failures. In practice, correlated failures (power outages, network equipment failures) violate these assumptions precisely when they matter most. 3PC also adds latency, making it unsuitable for high-throughput systems.

The FLP impossibility result provides theoretical grounding for this limitation: no deterministic consensus protocol can guarantee both safety and liveness in an asynchronous system with even one faulty process. 2PC chooses safety over liveness, accepting blocking as the price. Understanding this trade-off is essential—you cannot engineer around a theoretical impossibility, only choose which guarantees to sacrifice.

Takeaway
Coordinator failures during 2PC's prepared state create unavoidable blocking. This isn't a bug to fix but a fundamental trade-off between safety and liveness that scales poorly with system size.

Saga Patterns and the Complexity of Semantic Compensation

Sagas decompose a distributed transaction into a sequence of local transactions T₁, T₂, ..., Tₙ, each with a corresponding compensating transaction C₁, C₂, ..., Cₙ. If Tₖ fails after T₁ through Tₖ₋₁ have committed, the system executes Cₖ₋₁, Cₖ₋₂, ..., C₁ in reverse order to undo the effects. This eliminates the blocking problem—each local transaction commits immediately, releasing locks without waiting for global coordination.

The critical insight is that sagas provide semantic atomicity rather than true atomicity. The intermediate states where T₁ through Tₖ₋₁ have committed but Tₖ has not are visible to other transactions. This is acceptable when business logic can tolerate such visibility, but requires careful analysis of invariants that might be violated during partial execution.

Compensation is where saga complexity explodes. For simple operations like crediting an account, compensation (debiting) is straightforward. But many operations have non-invertible effects. Sending an email cannot be unsent. Shipping a physical product cannot be unshipped after delivery. Notifying a third-party system may trigger actions beyond your control. Saga design must account for these semantic realities.

The compensation logic itself can fail, requiring nested compensation handling. Consider a saga where T₃ fails and C₂ must execute, but C₂ also fails. Now you need compensation for the failed compensation—a pattern that can recurse indefinitely without careful bounded-retry policies and ultimate fallback to manual intervention. Saga orchestrators must implement compensation idempotency to handle retries safely.

Choreography-based sagas distribute compensation responsibility across services, reducing single points of failure but complicating debugging and reasoning about system state. Orchestration-based sagas centralize control, improving observability but reintroducing coordinator-like failure modes. Neither approach eliminates saga complexity—they merely relocate it. The choice depends on organizational factors as much as technical ones.

Takeaway
Sagas trade atomic isolation for availability, but compensation logic complexity grows with operation irreversibility. Design sagas around operations with natural inverses; isolate non-invertible operations at saga boundaries.

Deterministic Databases and Pre-Ordered Execution

Deterministic database systems like Calvin take a fundamentally different approach: instead of coordinating during transaction execution, they coordinate on transaction ordering before execution begins. All nodes agree on a total order of transactions, then execute them deterministically. Since execution is deterministic and order is predetermined, all nodes reach identical states without runtime coordination.

This architecture eliminates 2PC entirely for read-write transactions spanning multiple partitions. The sequencer layer batches incoming transactions and distributes ordered batches to all nodes. Each node executes the batch in order, applying transactions to its local partitions. Cross-partition transactions simply execute at each relevant partition in the predetermined order—no voting, no prepared states, no blocking.

The performance implications are profound. Traditional distributed databases spend significant CPU cycles on lock management, deadlock detection, and coordination messaging. Calvin-style systems replace this with deterministic lock acquisition following the predetermined order, eliminating deadlocks by construction. Throughput scales with batch size, as coordination cost is amortized across all transactions in a batch.

However, deterministic execution imposes architectural constraints that many applications cannot satisfy. All transaction logic must be known before execution begins—you cannot read a value and then conditionally decide what to write based on that value, because the sequencer must know all read and write sets upfront. This eliminates a large class of application patterns and requires restructuring transaction logic.

Reconnaissance queries address some limitations by pre-reading data before formal transaction submission, but they introduce their own consistency challenges. OLLP (Optimistic Lock Location Prediction) allows speculative execution with abort-and-retry when predictions fail. These techniques expand the applicability of deterministic systems but add complexity. The fundamental trade-off remains: predetermined ordering enables coordination-free execution but constrains transaction expressiveness.

Takeaway
Deterministic databases eliminate coordination overhead by agreeing on transaction order before execution. The constraint is that all read-write sets must be known upfront—trading transaction flexibility for massive throughput gains.

Distributed transaction failures at scale are not implementation bugs but manifestations of fundamental theoretical limits. Two-phase commit's blocking behavior, saga compensation complexity, and deterministic ordering constraints each represent different trade-off surfaces in the same impossibility landscape.

Choosing the right approach requires honest assessment of your consistency requirements. If you truly need serializable cross-partition transactions, accept 2PC's availability limitations and engineer for coordinator reliability. If eventual consistency suffices, sagas provide availability at the cost of compensation complexity. If you can restructure transactions to declare read-write sets upfront, deterministic systems offer the highest throughput.

The mistake is expecting a solution that sacrifices nothing. FLP guarantees you cannot have safety, liveness, and fault tolerance simultaneously in an asynchronous system. Your architecture must choose which guarantee to weaken—and that choice should be explicit, informed, and aligned with business requirements.