NUMA Architecture: Programming for Non-Uniform Memory

macro photography of black circuit board

6 min read

NUMA architectures introduce non-uniform memory access costs that break the uniform-memory assumption underlying most performance analysis.

Measuring actual latency and bandwidth topology under load is essential because firmware-reported distances and idle benchmarks misrepresent production behavior.

Thread placement via CPU affinity and first-touch initialization patterns determines whether parallel workloads scale linearly or stall at socket boundaries.

Memory allocation policies—local, interleaved, and explicit binding—should be chosen to match access patterns rather than applied uniformly.

Effective NUMA optimization composes topology awareness, thread placement, and allocation policy into a coherent strategy aligned with workload structure.

Modern multi-socket servers shatter a foundational assumption that programmers have relied upon for decades: that memory access cost is uniform. On a contemporary two-socket Xeon or EPYC system, a load from local DRAM completes in roughly 80 nanoseconds, while the same load targeting memory attached to a remote socket may require 130 to 150 nanoseconds. The ratio appears modest until you multiply it across billions of operations per second.

This asymmetry, known as Non-Uniform Memory Access, transforms performance engineering into a topological problem. The question is no longer merely how fast is my algorithm, but where does my data live relative to the thread that consumes it. Cache-conscious design, once sufficient for single-socket workloads, becomes insufficient when the cache miss path traverses an inter-socket interconnect such as UPI or Infinity Fabric.

What follows is a rigorous treatment of NUMA from three perspectives: the measurement of latency topology, the placement of threads relative to their working sets, and the policies that govern where allocated pages physically reside. Each dimension presents distinct trade-offs, and the optimal configuration depends on the access pattern, the working set size, and the degree of inter-thread sharing. Mastering these techniques is no longer optional for systems targeting throughput at scale; it is the difference between linear and sub-linear scaling as core counts climb.

Memory Latency Topology: Measuring the Cost of Distance

Every NUMA system exposes a topology graph in which nodes represent memory controllers and edges carry weighted access costs. On Linux, this graph is materialized through /sys/devices/system/node/nodeN/distance, which encodes ACPI SLIT values. A typical two-socket system reports a local distance of 10 and a remote distance of 21, a ratio that approximates but does not precisely equal the measured latency penalty.

Empirical measurement is essential because firmware-reported distances are advisory. The canonical methodology uses a pointer-chasing microbenchmark with a randomized linked list larger than the last-level cache. By pinning the measuring thread to a specific CPU via sched_setaffinity and allocating the chase buffer on a specific node via numa_alloc_onnode, you can construct an N×N latency matrix that captures real interconnect behavior under your workload.

The Intel Memory Latency Checker and the open-source lmbench suite automate this characterization, producing both idle and loaded latency tables. Loaded latency is the critical metric: under bandwidth saturation, remote access penalties can balloon to 3× or 4× their idle values as the interconnect queues fill. This non-linearity is invisible to first-order analysis and is the source of many production scaling surprises.

Beyond latency, bandwidth asymmetry matters equally. A single socket can typically sustain its full local DRAM bandwidth, but aggregate remote bandwidth is capped by interconnect width—often 30 to 40 GB/s per link versus 200+ GB/s of local memory bandwidth. Workloads that stream large datasets are therefore bandwidth-bound on remote access long before they become latency-bound.

Characterizing your specific platform with both micro and macro benchmarks before optimization is mandatory. Generic assumptions—that remote access is 1.5× slower, that all nodes are symmetric, that the topology is fully connected—frequently fail on modern hardware where chiplet architectures introduce intra-socket NUMA domains and asymmetric link topologies.

Takeaway
Treat NUMA topology as a measured property of your specific hardware under your specific load, not as a fixed constant. The interconnect is a queue, and its behavior under contention dominates the average case.

Thread Placement Strategies: Co-locating Computation with Data

Once topology is understood, the next discipline is placement: ensuring that threads execute on cores whose attached memory controller hosts their working set. The Linux kernel offers two primary mechanisms. sched_setaffinity binds a thread to a CPU set, constraining the scheduler. set_mempolicy and its per-allocation cousin mbind control where physical pages are allocated and migrated.

The principle of first-touch allocation underpins most NUMA optimization on Linux. When a thread writes to a page for the first time, the kernel allocates the physical page from the NUMA node local to the executing CPU. This means the initialization phase of your application dictates physical layout. Naive parallel initialization, where one thread zeroes a large array, places the entire array on one node—catastrophic if subsequent work is partitioned across sockets.

The correct pattern is to parallelize initialization with the same partitioning used by the computation phase. If thread i will process array slice A[i*N/T : (i+1)*N/T], then thread i must also be the one to first-touch that slice. This single discipline often yields 30 to 50 percent throughput improvements on memory-bound parallel workloads.

For workloads with cross-socket communication, placement becomes a graph partitioning problem. Threads that share data heavily should reside on the same socket; threads that operate independently can be spread to maximize aggregate bandwidth. Tools like numactl --cpunodebind provide coarse control, while libraries such as hwloc expose the topology programmatically for fine-grained placement logic.

Beware the scheduler's load balancer, which may migrate threads across sockets in pursuit of CPU utilization, defeating careful placement. Hard affinity via sched_setaffinity or cgroup cpuset constraints is necessary for latency-sensitive workloads. The cost of a single cross-socket migration—cold caches, TLB flushes, and now-remote memory access—can exceed milliseconds of wall time.

Takeaway
Placement is not a hint; it is a contract. The thread that first touches a page owns its location, and the scheduler will violate your assumptions unless you explicitly forbid it.

NUMA-Aware Allocation: Choosing the Right Memory Policy

Memory allocation policy is the third lever, complementing topology awareness and thread placement. Linux supports four primary policies via set_mempolicy: MPOL_DEFAULT (first-touch local), MPOL_BIND (strict allocation from a node set), MPOL_PREFERRED (soft preference with fallback), and MPOL_INTERLEAVE (round-robin across nodes).

Interleaved allocation is the default choice for read-mostly data structures accessed uniformly by all threads—configuration tables, lookup indexes, or shared caches. By distributing pages across nodes at page granularity, interleave averages access cost and balances bandwidth load across memory controllers. The trade-off is that no access is local; all accesses pay an averaged latency rather than a bimodal local/remote distribution.

Local allocation, by contrast, is correct for partitioned workloads where each thread operates on a private working set. Database shards, parallel sort partitions, and per-thread arenas all benefit from strict locality. Combined with thread pinning, local allocation can deliver near-ideal scaling, with each socket operating as an effectively independent system.

For workloads that defy simple classification, explicit per-region placement via mbind gives surgical control. A typical pattern allocates hot, frequently-accessed structures locally while interleaving cold, infrequently-touched data. This requires profiling to identify access frequencies, but the resulting policy matches physical layout to logical access patterns precisely.

Modern allocators are increasingly NUMA-aware. jemalloc supports per-arena policies, tcmalloc offers per-CPU caches that naturally align with NUMA boundaries, and the kernel's AutoNUMA feature attempts automatic page migration based on observed access patterns. AutoNUMA is convenient but introduces overhead and migration latency; for predictable, high-performance workloads, explicit policy remains superior to autonomic mechanisms.

Takeaway
Allocation policy should mirror the access pattern: interleave what is shared uniformly, localize what is partitioned, and place explicitly what defies simple description.

NUMA is not an exotic concern of supercomputing—it is the default reality of every multi-socket server and increasingly of single-socket chiplet designs. The performance engineer who treats memory as uniform pays a tax measured in tens of percent of achievable throughput, often more.

The three disciplines compose. Topology measurement reveals the cost structure. Thread placement co-locates computation with the data it consumes. Allocation policy ensures that data physically resides where it is needed. Omitting any one undermines the others; a perfectly placed thread accessing interleaved data still pays remote latency on most loads.

The deeper principle is that locality is a multi-dimensional property. Cache locality, page locality, and socket locality form a hierarchy, each level with its own measurement tools, optimization techniques, and failure modes. Mastery of NUMA is mastery of the outermost level—the one most often overlooked, and the one whose neglect most decisively constrains scaling at the high end.