When you write x = 1; ready = true; in a concurrent program, you might reasonably expect these operations to become visible to other threads in that order. This assumption is catastrophically wrong on modern hardware. Between your source code and actual memory operations lies a complex machinery of compiler optimizations, store buffers, and cache coherence protocols—each with its own agenda for reordering your carefully sequenced instructions.

The performance of modern processors depends critically on their ability to execute operations out of order. A CPU that waited for every store to propagate to main memory before proceeding would run perhaps fifty times slower than contemporary chips. Store buffers, write combining, and speculative execution all exist because memory is slow and parallelism is fast. But this performance comes at a cost: the illusion of sequential execution that makes reasoning about programs tractable simply does not hold in concurrent contexts.

Understanding memory barriers requires grasping two distinct reordering sources that operate at different levels of your system stack. Compilers reorder and eliminate memory operations based on the as-if rule—they may transform your code arbitrarily provided single-threaded semantics are preserved. Hardware then applies its own reorderings invisible to the compiler. Correct concurrent code must constrain both, and the mechanisms for doing so are neither intuitive nor portable. What follows is a rigorous examination of these hidden transformations and the barriers that tame them.

Store Buffer Effects

Modern CPUs do not write directly to cache or memory. Instead, stores enter a store buffer—a small queue that decouples execution from the memory hierarchy. This buffer allows the processor to continue executing instructions while stores drain to cache in the background. From the perspective of the core that issued the store, the write appears complete immediately. From every other core's perspective, the write does not exist until it drains from the buffer and becomes globally visible.

This asymmetry creates observable reordering even on processors with relatively strong memory models. Consider the classic store buffer litmus test: Thread 1 executes x = 1; r1 = y; while Thread 2 executes y = 1; r2 = x;. With both x and y initially zero, sequential intuition suggests at least one thread must see the other's store. Yet on real hardware, r1 = r2 = 0 is a legal outcome. Each thread's store sits in its local buffer while its load reads the stale value from cache.

This behavior breaks fundamental synchronization patterns. A naive spinlock implementation where Thread 1 sets data = 42; flag = 1; and Thread 2 spins on while(!flag); use(data); can fail spectacularly. Thread 2 may observe flag = 1 before observing data = 42 if the stores drain in different orders or if the loads are satisfied from different cache states. The resulting bug—using uninitialized or partially initialized data—is non-deterministic, architecture-dependent, and nearly impossible to reproduce under debugging conditions.

Store buffers also interact with store-to-load forwarding. When a core loads from an address present in its own store buffer, the hardware typically forwards the value directly, bypassing cache. This forwarding is local to the core. Other cores loading the same address see the old cache value. The result is that different cores observe memory states that never existed globally—a phenomenon that violates even basic intuitions about shared memory.

Hardware memory barriers (fences) address store buffer effects by forcing drainage before subsequent operations. An mfence on x86 ensures all prior stores are globally visible before any subsequent loads execute. ARM and POWER provide more granular barriers with dmb and sync variants that constrain specific operation types. The cost of these barriers ranges from tens to hundreds of cycles—acceptable for synchronization points but prohibitive for fine-grained use. Designing lock-free algorithms means minimizing barrier frequency while preserving correctness.

Takeaway

Store buffers make writes invisible to other cores until explicitly fenced; any synchronization between threads that relies on store ordering without barriers is broken by design, regardless of how intuitive the code appears.

Compiler Fence Semantics

Before your code reaches the CPU, the compiler transforms it according to the as-if rule: any optimization is legal if single-threaded observable behavior is preserved. Compilers routinely reorder loads and stores, eliminate redundant accesses, hoist loop-invariant memory operations, and sink stores past conditionals. None of these transformations consider other threads because the C and C++ memory models historically provided no concurrent semantics. Your carefully ordered stores may reach the processor in arbitrary sequence—or not at all.

Consider a flag-based synchronization where Thread 1 sets data then sets ready, while Thread 2 polls ready then reads data. Without volatile or explicit barriers, an optimizing compiler may determine that ready is loop-invariant in Thread 2 and hoist its read, spinning forever on a cached value. It may reorder the stores in Thread 1 if it determines no single-threaded path observes them in sequence. The generated assembly bears no necessary relationship to source code order.

Compiler barriers prevent reordering across specific points without generating hardware instructions. In GCC, asm volatile("" ::: "memory") tells the compiler that memory may be arbitrarily modified by the inline assembly, forcing it to reload values and complete pending stores around that point. MSVC provides _ReadWriteBarrier(). These barriers cost nothing at runtime but constrain optimization. They are necessary but not sufficient for correctness—they cannot prevent hardware reordering.

The distinction between compiler and hardware barriers creates a layered correctness requirement. A compiler barrier ensures assembly order matches source order. A hardware barrier ensures execution order matches assembly order across cores. Correct synchronization typically requires both. Modern C++ atomics provide this through memory order specifications: memory_order_acquire and memory_order_release generate both compiler barriers and appropriate hardware instructions for the target architecture.

Understanding this layering explains why volatile is insufficient for synchronization. Volatile prevents the compiler from optimizing away or reordering accesses to a specific variable, but provides no hardware barriers and no ordering guarantees relative to non-volatile accesses. A volatile flag variable may be reordered relative to the data it guards at the hardware level. C11/C++11 atomics exist precisely because volatile semantics are inadequate for concurrent programming, despite decades of incorrect folklore suggesting otherwise.

Takeaway

Compiler barriers and hardware barriers solve different problems—compiler barriers prevent optimization-induced reordering in generated code, while hardware barriers prevent execution-time reordering across cores; correct synchronization requires addressing both layers explicitly.

Architecture Variations

x86 implements Total Store Order (TSO), one of the strongest memory models in commodity hardware. Stores are never reordered with other stores. Loads are never reordered with other loads. Stores are never reordered with prior loads. The only permitted reordering is that loads may be reordered before prior stores to different addresses—the store buffer effect described earlier. This means most correctly-locked x86 code works accidentally, masking barrier omissions that would fail catastrophically on other architectures.

ARM and POWER implement much weaker models where almost any operation may be reordered with any other absent explicit barriers. Independent stores may become visible in different orders to different observers. Loads may be speculated and satisfied out of order. Even dependent loads—where the address of the second load depends on the value of the first—require barriers on some POWER implementations. Code that relies on x86's implicit ordering guarantees will exhibit data races on these architectures.

The practical consequence is that portable concurrent code cannot rely on architecture-specific behavior. A lock-free queue developed and tested on x86 may pass millions of test iterations, then corrupt data immediately on ARM. The ARM memory model permits reorderings that never occur in practice on x86 but are architecturally legal and do occur on real hardware under specific cache pressure and timing conditions.

Memory ordering annotations in C++11 provide architecture abstraction. memory_order_seq_cst provides sequential consistency through full barriers on all architectures. memory_order_acquire and memory_order_release map to no-ops on x86 for loads and stores respectively, but generate dmb instructions on ARM. memory_order_relaxed provides atomicity without ordering, enabling maximum performance when ordering is unnecessary or provided by other means.

Barrier costs vary dramatically across architectures. On x86, mfence costs roughly 100 cycles due to store buffer drainage. On ARM, barrier costs depend on the specific implementation and memory system state, ranging from tens to hundreds of cycles. POWER's sync instruction is notoriously expensive. These costs make barrier placement critical for lock-free performance. The goal is always to use the weakest barrier that provides correctness—typically acquire/release pairs rather than full sequential consistency.

Takeaway

x86's strong memory model masks barrier omissions that cause failures on ARM and POWER; portable concurrent code must use explicit memory ordering annotations that compile to architecture-appropriate barriers rather than relying on platform-specific behavior.

Memory barriers exist because modern hardware and compilers optimize aggressively under assumptions that break in concurrent contexts. Store buffers create windows where writes are invisible to other cores. Compilers reorder operations freely when single-threaded semantics are preserved. These optimizations are essential for performance—and lethal for unsynchronized concurrent access.

The layered nature of memory ordering requires understanding both compilation and execution. Compiler barriers constrain code generation. Hardware barriers constrain execution. C++11 atomics combine both, providing portable abstractions that emit architecture-appropriate instructions. Relying on volatile, implicit ordering, or x86-specific guarantees produces code that is broken by specification even when it appears to work.

Mastering memory barriers means accepting that intuition fails at this level. The only reliable reasoning is formal: identify which orderings your algorithm requires, express them through explicit memory order annotations, and verify that these constraints are sufficient across all architectures you target. The alternative is subtle corruption that manifests only under load, only on certain hardware, only when you least expect it.