Memory Reclamation in Lock-Free Systems

macro photography of black circuit board

7 min read

Lock-free data structures decouple logical removal from physical deallocation, creating a window where freed memory may still be accessed by concurrent threads.

Hazard pointers solve this through per-thread pointer reservations with bounded memory overhead but require memory fences on every read.

Epoch-based reclamation uses a global epoch counter to batch reclamations cheaply, at the cost of unbounded memory when threads stall.

Each scheme embodies different assumptions about thread behavior, memory pressure, and the cost of synchronization on the read path.

Production systems typically hybridize these techniques, matching reclamation strategy to access pattern at each layer of the architecture.

Lock-free data structures promise progress guarantees that traditional mutex-based designs cannot offer: at least one thread always makes forward progress, regardless of arbitrary delays in others. Yet this elegance conceals a profound engineering problem. When a thread removes a node from a shared structure, when exactly is it safe to reclaim that memory?

In a single-threaded world, the answer is trivial. In a concurrent setting without locks, the question becomes deeply subtle. Other threads may hold references to the removed node, having loaded pointers before the removal became visible. Freeing the memory prematurely produces use-after-free hazards; deferring it indefinitely produces unbounded memory growth.

This tension—between safety and reclamation latency—defines one of the central challenges in concurrent programming. The literature offers several principled solutions: hazard pointers, epoch-based reclamation, and various forms of reference counting. Each embodies distinct assumptions about thread behavior, memory pressure, and synchronization cost. Each occupies a different point in the design space, with formally analyzable trade-offs in space overhead, time complexity, and progress guarantees. Understanding these techniques is not optional for systems architects building scalable infrastructure—it is foundational. This article examines the mathematical structure of the reclamation problem and dissects the three dominant approaches, illuminating when each becomes the rational choice.

The Reclamation Problem

Consider a lock-free stack implemented via a singly-linked list with a Compare-And-Swap (CAS) on the head pointer. Thread A executes pop: it reads the head pointer, dereferences it to load the next pointer, then attempts a CAS to swing head from the current node to its successor. Between the initial read and the dereference, thread B may also pop the same node, succeed in its CAS, and proceed to free the memory. Thread A's subsequent dereference is now a use-after-free—undefined behavior, potentially a segmentation fault or a corrupted read.

This is the essence of the safe memory reclamation (SMR) problem. The removal of a node from the logical structure and the deallocation of its physical memory are no longer the same act. Between these events lies a window in which other threads may still legitimately access the node, having acquired references before its removal became globally visible.

Formally, a node n is safe to reclaim only when no thread holds a pointer to n obtained prior to n's logical removal, and no thread will subsequently obtain such a pointer. Establishing this property requires coordination—precisely the coordination that lock-free algorithms strive to minimize.

The naive solution, garbage collection, suffices in managed languages but introduces stop-the-world pauses incompatible with the latency profile of high-performance systems. In unmanaged environments such as C++ or Rust, explicit reclamation is unavoidable, and the algorithm must furnish its own correctness argument.

The problem is further complicated by the ABA hazard: a pointer value may be recycled, with a freed-then-reallocated node masquerading as the original. CAS operations cannot distinguish the impostor from the original by value alone. Memory reclamation schemes must therefore address not only liveness of references but also the integrity of pointer identity over time.

Takeaway
Lock-free does not mean coordination-free. The removal of a reference and the reclamation of memory are distinct events, and bridging them safely is where the real engineering lives.

Hazard Pointer Mechanics

Hazard pointers, introduced by Maged Michael in 2004, provide a per-thread reservation protocol. Each thread maintains a small fixed number of hazard pointer slots—typically one or two per data structure. Before dereferencing a shared pointer, the thread publishes the pointer value into one of its hazard slots and then re-validates that the pointer is still reachable from the structure. This publish-and-verify pattern ensures that any thread attempting to reclaim the node will observe the hazard before proceeding.

Reclamation proceeds in three phases. First, retirement: a thread that removes a node places it into a thread-local retired list rather than freeing it directly. Second, when the retired list exceeds a threshold (commonly proportional to the total number of hazard pointers, say 2HN where H is hazard pointers per thread and N is thread count), the thread initiates a scan. It collects all currently published hazard pointers across all threads into a hash set. Third, for each node in its retired list, the thread checks membership in this set: nodes absent from the hazard set are provably safe to free; nodes present are deferred to a future scan.

The space complexity is O(HN + R) where R bounds the per-thread retired list size, yielding bounded memory overhead—a critical property absent in many alternatives. The amortized time cost per reclamation is O(HN) for the scan, amortized over the threshold of retirements between scans, giving O(1) amortized cost per node when the threshold is chosen proportionally.

Critically, hazard pointers require memory fences on the read path. After publishing a hazard pointer via a store, the thread must execute a full fence (or sequentially consistent store) before re-reading the source pointer. On x86, this fence is the dominant cost on the hot path; on weaker architectures such as ARM, the cost is even more pronounced.

The principal weakness is read-path overhead. Every traversal step in a linked structure incurs a fence, making hazard pointers expensive for read-heavy workloads with deep traversals. Their strength lies in bounded memory and per-object granularity: a single long-running reader cannot delay reclamation of unrelated objects.

Takeaway
Hazard pointers trade fence cost on every read for tight memory bounds. They are the right tool when memory pressure matters more than read throughput.

Epoch-Based Reclamation

Epoch-based reclamation (EBR), formalized by Keir Fraser and refined in subsequent work, takes a coarser-grained approach. The system maintains a global epoch counter, advanced monotonically. Each thread, upon entering a critical section that may access shared structures, records the current global epoch into a per-thread local variable. Upon exiting, it marks itself as quiescent.

Reclamation exploits a simple invariant: a node retired during epoch e is safe to reclaim once all threads have observed an epoch strictly greater than e. The protocol typically uses three epochs (current, current minus one, current minus two), advancing the global counter when all active threads are observed in the two most recent epochs. Retired nodes are placed in per-epoch limbo lists; when an epoch becomes provably unreachable, its entire limbo list is freed in bulk.

The read-path cost is dramatically lower than hazard pointers. Entering a critical section requires only a relaxed load of the global epoch and a store to a per-thread variable—no fence on each traversal step. For read-heavy workloads with deep pointer chasing, EBR can outperform hazard pointers by an order of magnitude.

However, EBR's progress is contingent on quiescence. A single thread that fails to exit its critical section—due to preemption, blocking syscall, or simply long-running iteration—prevents the global epoch from advancing, halting all reclamation. Memory usage is therefore unbounded in the worst case. This is unacceptable in systems where threads may be descheduled arbitrarily, such as those exposed to user code or operating under aggressive virtualization.

Variants address these weaknesses. Quiescent-state-based reclamation (QSBR), used in RCU, treats certain program points (context switches, syscall boundaries) as implicit quiescent states, eliminating critical-section bookkeeping entirely at the cost of restricting where pointers may be held. Interval-based reclamation and hyaline blend epoch counters with per-object reservations, recovering bounded memory while preserving most of EBR's read efficiency.

Takeaway
Coarse synchronization is cheaper than fine synchronization, but only when no participant stalls. EBR is a bet on bounded quiescence intervals.

The choice among reclamation schemes is not a matter of universal superiority but of workload alignment. Hazard pointers excel when memory bounds are non-negotiable and reader pauses are unpredictable. Epoch-based reclamation dominates when traversals are deep, threads are cooperative, and bursty memory consumption is tolerable. Reference counting—when implemented atomically with split counters or deferred decrements—offers compositional simplicity at the cost of per-pointer cache traffic.

Mature systems often hybridize. The Linux kernel pairs RCU for read-mostly structures with explicit reference counting at object boundaries. Modern databases combine epoch reclamation for indexes with hazard pointers for long-lived auxiliary structures. The right answer is rarely a single mechanism but a layered composition matched to access patterns.

The deeper principle: lock-freedom shifts complexity rather than eliminating it. The progress guarantee at the algorithmic layer is purchased with intricate machinery at the memory layer. Mastering this trade-off—reasoning formally about epochs, hazards, and quiescence—is what separates correct concurrent code from code that merely appears to work.