Lock-Free Data Structures: Beyond Compare-and-Swap

macro photography of black circuit board

7 min read

Lock-free programming promises concurrent progress without mutex bottlenecks, but CAS operations introduce subtle correctness challenges like the ABA problem.

Three ABA solutions—tagged pointers, hazard pointers, and epoch-based reclamation—offer different trade-offs between memory overhead, throughput, and workload characteristics.

The C++ memory model's acquire-release semantics establish synchronization relationships without global barriers, but memory ordering bugs manifest probabilistically and vary by architecture.

Progress guarantees range from obstruction-free (progress in isolation) through lock-free (system-wide progress) to wait-free (bounded individual progress), each suited to different failure mode tolerances.

Production lock-free implementations require careful selection of reclamation strategies, conservative memory ordering, and extensive cross-architecture testing.

Lock-free programming represents one of the most demanding disciplines in systems engineering. The fundamental promise is compelling: concurrent data structures that guarantee system-wide progress without the serialization bottlenecks of traditional mutex-based synchronization. Yet practitioners quickly discover that compare-and-swap (CAS), the cornerstone primitive of lock-free algorithms, introduces its own constellation of subtle correctness and performance challenges.

The gap between understanding CAS semantics and building production-grade lock-free structures spans years of hard-won experience. Basic tutorials demonstrate simple counters and stacks, but real-world concurrent queues, maps, and memory allocators require techniques that most engineers never encounter. The ABA problem lurks in seemingly correct algorithms. Memory ordering violations manifest as impossible states under high contention. And the distinction between lock-free and wait-free guarantees determines whether your system degrades gracefully or catastrophically under load.

This examination targets engineers who have moved beyond introductory lock-free concepts and now face the engineering realities of deploying these structures at scale. We will dissect the ABA problem with mathematical precision, explore the C++ memory model's acquire-release semantics as they apply to practical implementations, and formalize the progress guarantees that distinguish robust concurrent algorithms from fragile ones. The goal is not theoretical elegance but actionable understanding—the kind that prevents production outages and enables genuine performance breakthroughs.

The ABA Problem Solved

The ABA problem represents a fundamental failure mode in CAS-based algorithms where a memory location transitions from value A to value B and back to A between a thread's read and its subsequent CAS operation. The CAS succeeds because it observes the expected value, but the underlying data structure state may have changed dramatically. In a lock-free stack, for example, a thread might pop a node, have that node recycled and pushed back, then complete its stale operation—corrupting the structure entirely.

Tagged pointers address ABA by combining a pointer with a monotonically increasing counter in a single atomic word. Each modification increments the counter, ensuring that even if the pointer value recurs, the combined tagged value never repeats within practical bounds. On 64-bit systems, this typically reserves 16-48 bits for the counter depending on alignment constraints. The trade-off is clear: reduced address space and reliance on double-width CAS operations (CMPXCHG16B on x86-64), which carry performance penalties on some microarchitectures.

Hazard pointers, introduced by Maged Michael, take a fundamentally different approach. Each thread publishes pointers it is currently accessing to a per-thread hazard pointer array. Before reclaiming memory, a thread scans all hazard pointers to ensure no other thread holds a reference. This provides precise memory reclamation without tagged pointer overhead but requires careful protocol adherence and introduces scanning costs proportional to thread count. The technique excels in read-heavy workloads where the reclamation cost is amortized.

Epoch-based reclamation (EBR) batches memory reclamation by tracking global epochs. Threads announce when they enter and exit critical sections, and memory is reclaimed only when all threads have passed through at least one complete epoch transition. The RCU (Read-Copy-Update) mechanism in Linux kernels represents a highly optimized variant. EBR offers superior throughput under high contention but struggles with threads that stall within critical sections, potentially unbounding memory consumption.

Choosing among these techniques requires analyzing your workload's characteristics. Tagged pointers suit structures with frequent modifications and bounded concurrency. Hazard pointers excel when read operations dominate and memory footprint matters. EBR delivers maximum throughput when thread stalls are rare and memory overhead is acceptable. Production systems increasingly combine approaches: hazard pointers for hot paths with EBR for bulk reclamation during quiescent periods.

Takeaway
Select your ABA solution based on workload characteristics: tagged pointers for write-heavy bounded concurrency, hazard pointers for read-heavy memory-constrained scenarios, and epoch-based reclamation for maximum throughput when stalls are rare.

Memory Ordering Semantics

The C++ memory model defines six memory orderings, but lock-free programming predominantly concerns itself with three: memory_order_relaxed, memory_order_acquire, memory_order_release, and their combination memory_order_acq_rel. Sequential consistency (memory_order_seq_cst) provides the strongest guarantees but incurs full memory barrier costs on most architectures. Understanding when weaker orderings suffice separates performant implementations from correct-but-slow ones.

Acquire-release semantics establish synchronization relationships between threads. A release store synchronizes with an acquire load of the same atomic variable. Critically, all writes performed before the release become visible to operations after the acquire. This creates a happens-before edge in the execution graph without requiring global synchronization. For a lock-free queue, the producer's release store to the tail pointer ensures the consumer's acquire load sees the fully constructed node.

The subtlety emerges in dependent operations. Consider a lock-free stack where a thread loads a node pointer, dereferences it to read the next pointer, then performs CAS. The initial load must be acquire to ensure the node's contents are visible. But what ordering does the CAS require? If the CAS fails, no synchronization is needed—relaxed suffices for the failure case. If it succeeds, release semantics ensure our modifications become visible. Hence memory_order_acq_rel for success, memory_order_relaxed for failure.

Consume ordering (memory_order_consume) theoretically provides weaker guarantees than acquire for dependent loads, exploiting data dependencies rather than memory barriers. In practice, all major compilers promote consume to acquire because correctly implementing consume semantics proved intractable. The standardization of consume ordering remains a cautionary tale about the gap between theoretical memory models and practical compilation.

Debugging memory ordering violations presents unique challenges because bugs manifest probabilistically under specific timing conditions and compiler optimizations. Thread sanitizer (TSan) detects data races but not all memory ordering violations. Formal verification tools like CDSChecker and model checkers can exhaustively explore interleavings for small programs. For production code, the pragmatic approach combines conservative ordering choices with extensive stress testing across architectures—ARM and POWER exhibit weaker native ordering than x86-64, exposing bugs that never manifest on Intel hardware.

Takeaway
Default to acquire-release semantics at synchronization boundaries and prove weaker orderings correct before deploying them; ARM and POWER architectures expose ordering bugs that x86 masks.

Progress Guarantees

Lock-freedom guarantees that at least one thread makes progress in a finite number of steps, even if other threads are arbitrarily delayed or fail. This is weaker than it sounds: individual threads may starve indefinitely while the system as a whole advances. The guarantee prevents deadlock and livelock but permits unbounded waiting for any particular operation. Most practical "lock-free" data structures provide this level of guarantee.

Wait-freedom is strictly stronger: every thread completes its operation in a bounded number of steps regardless of other threads' behavior. This eliminates starvation entirely but comes at significant complexity and performance cost. The universal construction—which can transform any sequential data structure into a wait-free concurrent one—demonstrates theoretical achievability but is impractical for production use. Practical wait-free structures exist for specific problems (wait-free queues, counters) but remain rare in systems code.

Obstruction-freedom provides the weakest useful guarantee: a thread makes progress if it eventually executes in isolation. Under contention, obstruction-free algorithms may livelock indefinitely. This guarantee suffices for systems with contention management mechanisms (exponential backoff, priority scheduling) that ensure eventual isolation. Some argue obstruction-freedom better matches real-world execution patterns where true isolation is common.

Selecting appropriate guarantees requires understanding failure modes. Lock-freedom prevents system-wide hangs from thread failures—critical for fault-tolerant systems where threads may crash or be preempted indefinitely. Wait-freedom suits hard real-time systems with strict latency bounds where worst-case behavior matters. Obstruction-freedom accepts weaker guarantees in exchange for simpler implementations when contention is rare.

The progress guarantee taxonomy extends to helping mechanisms. Lock-free algorithms often achieve their guarantee by having threads help complete other threads' pending operations. This cooperative completion ensures system progress but can cause a single slow thread to trigger cascading work across all threads. Understanding the helping strategy—whether operations are helping, how much work helping requires, and how helping interacts with the memory reclamation scheme—proves essential for predicting performance under contention.

Takeaway
Lock-freedom prevents system hangs from thread failures, wait-freedom eliminates starvation at significant cost, and obstruction-freedom accepts weaker guarantees for simpler implementations—choose based on your failure mode tolerance.

Lock-free data structures occupy a narrow but critical niche in systems engineering: contexts where mutex contention demonstrably bottlenecks throughput and correctness requirements permit the substantial investment in verification and testing these techniques demand. The techniques examined here—ABA solutions, memory ordering semantics, and progress guarantees—form the conceptual foundation upon which production implementations rest.

The practical wisdom distills to measured application. Benchmark before committing to lock-free implementations; the constant factors often favor well-tuned locks under moderate contention. When lock-free structures prove necessary, prefer battle-tested libraries (Folly's AtomicHashMap, crossbeam in Rust, libcds) over custom implementations. Reserve novel lock-free development for genuine necessity backed by formal reasoning.

The field continues advancing. Hardware transactional memory, persistent memory programming models, and increasingly sophisticated compiler analyses will reshape best practices. But the fundamentals—understanding atomicity boundaries, reasoning about memory visibility, and proving progress properties—remain stable ground for the engineer navigating concurrent systems at scale.