Why Distributed Transactions Still Matter

body builder woman wearing black crop-top cross armed closeup photography

6 min read

The microservices movement turned 'avoid distributed transactions' from a reasonable default into unexamined dogma, but some operations genuinely require strong cross-service consistency.

Two-phase commit introduces bounded availability costs during coordinator failures, but modern implementations with persistent logs and consensus protocols significantly shrink the vulnerability window.

Saga patterns don't eliminate distributed transaction complexity — they redistribute it into compensation logic, which can be harder to test and reason about than the original consistency problem.

Choosing between distributed transactions and eventual consistency requires analyzing inconsistency cost, compensation feasibility, consistency window tolerance, and your team's operational maturity.

The architectural goal isn't to avoid distributed transactions everywhere — it's to apply the right consistency model to each cross-service operation based on honest risk assessment.

The microservices movement gave us a powerful heuristic: avoid distributed transactions. Design for eventual consistency. Use sagas. Let services own their data. For most systems, this guidance is sound. But somewhere along the way, a reasonable default hardened into dogma.

The reality is that some operations demand strong consistency across service boundaries, and pretending otherwise doesn't eliminate the problem — it just pushes complexity into compensation logic, manual reconciliation, and late-night incident calls. Financial transfers, inventory reservations during flash sales, regulatory compliance workflows — these aren't edge cases. They're core business operations where "eventually consistent" can mean "temporarily wrong in ways that cost real money."

This article isn't a call to wrap every microservice interaction in a two-phase commit. It's a framework for recognizing when distributed transaction complexity is justified and how modern patterns have changed the cost-benefit calculus. The goal is architectural precision — choosing the right consistency model for each operation rather than applying a single philosophy everywhere.

Two-Phase Commit Mechanics

Two-phase commit (2PC) is the canonical distributed transaction protocol, and it's earned its reputation for being problematic. In the prepare phase, a coordinator asks each participant whether it can commit. Each participant performs its work, acquires locks, and votes yes or no. In the commit phase, the coordinator collects votes. If every participant voted yes, it sends a commit message. If anyone voted no, it sends an abort. Simple in theory. Treacherous in practice.

The fundamental issue is the blocking problem. Once a participant votes yes, it holds locks and waits for the coordinator's decision. If the coordinator crashes between phases, participants are stuck — they can't commit, they can't abort, and they're holding resources hostage. This is the availability tradeoff at the heart of 2PC. You're choosing consistency at the cost of reduced system availability during coordinator failures. In a microservices landscape where independent deployability and fault isolation are primary goals, this tradeoff feels like heresy.

But understanding the failure modes precisely matters more than dismissing the protocol entirely. Coordinator failure during the commit phase is the critical vulnerability. Coordinator failure during the prepare phase is recoverable — participants that haven't voted simply abort. Modern implementations address this with persistent transaction logs, coordinator redundancy through consensus protocols like Raft, and heuristic decision-making after timeouts. These mitigations don't eliminate the fundamental tradeoff, but they shrink the vulnerability window significantly.

The availability cost of 2PC is real, but it's bounded and well-understood. Compare that to the unbounded complexity of discovering data inconsistencies weeks after they occurred. When architects dismiss 2PC reflexively, they're often comparing its worst case to the best case of eventual consistency — a comparison that flatters the wrong option for high-stakes operations.

Takeaway
Two-phase commit's availability cost is bounded and predictable. The cost of data inconsistency in the wrong domain is often unbounded and discovered too late. Know exactly what you're trading before you decide.

Modern Distributed Transaction Patterns

The saga pattern emerged as the microservices-friendly alternative to distributed transactions. Instead of a single atomic operation across services, a saga breaks work into a sequence of local transactions, each with a corresponding compensating action that undoes its effect if a later step fails. It's a clever inversion: rather than preventing inconsistency, you detect it and repair it. This works beautifully — until it doesn't.

Compensation is inherently more complex than rollback. Rolling back a database transaction restores the previous state perfectly. Compensating a business operation is a different beast entirely. You sent a confirmation email — you can't unsend it. You reserved inventory and another customer bought the last unit in the gap — compensation now involves customer communication, not just a database update. You transferred funds and the receiving account has already been closed. Each compensation step is its own failure-prone operation that needs its own error handling, retry logic, and edge case coverage.

Orchestration-based sagas centralize this complexity in a coordinator service that manages the workflow. Choreography-based sagas distribute it across events. Neither approach eliminates the fundamental challenge: your system spends time in an inconsistent state, and every participant must handle that gracefully. For many operations — shopping carts, social media posts, content publishing — this intermediate inconsistency is perfectly acceptable. Users don't notice or don't care.

But for operations where intermediate inconsistency has regulatory, financial, or safety implications, sagas introduce a category of bugs that are extraordinarily difficult to test comprehensively. The combinatorial explosion of failure points across a multi-step saga can make the operational complexity of 2PC look modest by comparison. The honest architectural assessment isn't that sagas replace distributed transactions — it's that they offer a different set of tradeoffs suited to a different set of problems.

Takeaway
Sagas don't eliminate distributed transaction complexity — they redistribute it into compensation logic. Before choosing sagas for a critical workflow, ask whether the compensation failure modes are simpler or harder to reason about than the consistency guarantee you gave up.

Consistency Requirement Analysis

The most valuable architectural skill here isn't knowing how to implement 2PC or sagas. It's knowing when each is appropriate. This requires a consistency requirement analysis — a structured evaluation of each cross-service operation against specific criteria. Start with the cost of inconsistency. If two services briefly disagree about state, what's the business impact? If the answer is "a user sees stale data for a few seconds," eventual consistency is fine. If the answer is "we double-charged a customer" or "we violated a regulatory reporting requirement," you need stronger guarantees.

Next, evaluate compensation feasibility. Can the operation be meaningfully reversed? Some operations are naturally idempotent and reversible — holding a reservation, flagging a record. Others are irrevocable — executing a wire transfer, submitting a regulatory filing, triggering a physical process. When compensation is difficult, unreliable, or impossible, the saga pattern's fundamental premise breaks down. You're building elaborate recovery machinery for situations where recovery may not work.

Consider the consistency window tolerance. How long can the system remain in an inconsistent state before real damage occurs? If the tolerance is measured in days, eventual consistency with reconciliation jobs is pragmatic. If it's measured in milliseconds — as in financial trading systems or real-time inventory during peak load — the window effectively demands synchronous consistency. The tolerance isn't just technical; it's a business decision that architects need to surface explicitly with stakeholders.

Finally, assess the operational maturity of your team. Saga-based systems require sophisticated observability — distributed tracing, compensation monitoring, inconsistency detection, and manual intervention tooling. If your team can't invest in that operational infrastructure, the theoretical elegance of sagas becomes a practical liability. A well-implemented 2PC with coordinator redundancy may actually be simpler to operate than a poorly-observed saga. The best architecture is the one your team can run reliably at three in the morning.

Takeaway
The decision between distributed transactions and eventual consistency isn't a technical preference — it's a risk assessment. Map each cross-service operation to its inconsistency cost, compensation feasibility, consistency window, and your team's operational capability.

The microservices community was right to push back against distributed transactions as a default. Most cross-service interactions genuinely work better with eventual consistency, and 2PC introduces real availability costs that shouldn't be accepted lightly.

But "avoid by default" is not the same as "avoid always." The architect's job is to distinguish between the 80% of operations where eventual consistency is elegant and the 20% where it creates hidden liabilities. That distinction requires honest analysis, not ideology.

Design each consistency boundary deliberately. Use eventual consistency where the business tolerates it. Use distributed transactions where the business demands them. The sophistication isn't in choosing one pattern — it's in knowing exactly where each belongs.