Asynchronous I/O Models Compared

Image by Museum of New Zealand Te Papa Tongarewa on Unsplash

macro photography of black circuit board

8 min read

Asynchronous I/O models differ not just in API but in fundamental architecture—where complexity lives, who performs the I/O, and how costs scale with connection count.

The reactor pattern, implemented by select, poll, and epoll, notifies applications of descriptor readiness and leaves I/O execution to user space, with epoll's persistent kernel state reducing monitoring overhead from O(n) to O(k).

Linux's io_uring introduces shared-memory ring buffers between user space and the kernel, enabling batched I/O submission and completion with zero system calls in the hot path—a structural improvement over epoll's per-operation syscall overhead.

The proactor pattern, exemplified by Windows IOCP and architecturally by io_uring, shifts I/O execution into the kernel and notifies applications only upon completion, yielding higher throughput ceilings at the cost of pre-allocated buffers and increased complexity.

The convergence of both major platforms toward completion-based models suggests that the industry has settled on a principle: the kernel should own I/O mechanics while user space owns application logic.

Every high-performance network server confronts the same fundamental constraint: the kernel boundary. Each system call to check socket readiness, accept a connection, or transfer data represents a context switch—a cost that compounds rapidly when a server manages tens of thousands of concurrent connections. The history of asynchronous I/O is a history of minimizing these boundary crossings while maximizing the useful work returned per crossing.

The progression from select to poll to epoll to io_uring on Linux—and the parallel evolution of I/O Completion Ports on Windows—represents increasingly sophisticated answers to this problem. Each model makes different assumptions about how the kernel and user space should coordinate, how file descriptors should be tracked, and where actual I/O execution belongs. These are not interchangeable wrappers around the same abstraction. They are fundamentally different architectures with distinct performance envelopes and complexity profiles.

Understanding these models requires more than familiarity with their APIs. It demands analyzing their algorithmic complexity, their memory access patterns, and their implications for application-level concurrency design. What follows is a rigorous comparison of the dominant async I/O approaches—examining where each pays its costs, what those costs buy at scale, and which architectural decisions explain the performance gaps between them.

Event Loop Mechanics

The reactor pattern provides the architectural foundation for most Unix-based async I/O. In this model, an event loop monitors a set of file descriptors for readiness—the ability to perform a non-blocking read or write—and dispatches registered callbacks when a descriptor transitions to a ready state. The loop is typically single-threaded, eliminating synchronization overhead but placing an upper bound on dispatch throughput equal to what one core can sustain.

The earliest implementations, select and poll, require the application to pass the entire monitored descriptor set to the kernel on every invocation. select encodes this set as a fixed-size bitmap, typically bounded at 1024 descriptors by FD_SETSIZE. poll uses a dynamically sized array of pollfd structures, lifting the hard limit but not the underlying cost. Both exhibit O(n) scanning behavior: the kernel iterates every descriptor to determine readiness, and the application iterates again to find which entries were marked. When n reaches tens of thousands, this redundant scanning dominates CPU time.

epoll eliminated this redundancy by maintaining a persistent interest set inside kernel memory. The epoll_ctl call modifies this set incrementally—adding, modifying, or removing individual descriptors—while epoll_wait returns only the descriptors that are actually ready. Per-call cost shifts from O(n) over the total monitored set to O(k) over the ready set, a critical distinction when n reaches hundreds of thousands but k remains small per iteration.

Thread pool sizing in reactor-based systems introduces a classic queuing theory challenge. A pool too small starves I/O-bound callbacks and blocks dispatch of newly ready events. A pool too large wastes memory and introduces context-switching overhead that erodes the event-driven advantage. The practical heuristic approximates N × (1 + W/S), where N is available cores, W is average wait time per task, and S is average service time. For CPU-bound dispatch this collapses to N. For mixed workloads, profiling under representative load remains the only reliable calibration.

A subtlety often missed is edge-triggered versus level-triggered notification. Level-triggered mode, the epoll default, re-reports a ready descriptor on every epoll_wait call until the condition clears. Edge-triggered mode reports only the transition to readiness, requiring the application to fully drain the descriptor or risk missing subsequent events. Edge-triggered semantics reduce redundant kernel-to-user wakeups but demand more disciplined application code—a direct trade-off between kernel efficiency and implementation complexity that shapes every reactor-based framework's internal architecture.

Takeaway
The reactor pattern's performance ceiling is not the event loop itself but the interaction between monitoring granularity, notification semantics, and thread pool calibration—each a separate lever with compounding effects at scale.

io_uring Architecture

Linux's io_uring, introduced in kernel 5.1, represents a paradigm shift in the relationship between user space and the kernel. Where epoll reduced the cost of monitoring descriptors, io_uring attacks the cost of performing the I/O itself. Its core innovation is a pair of lock-free ring buffers shared between user space and the kernel: the submission queue and the completion queue, mapped into the same physical memory pages accessible from both sides of the privilege boundary.

The application writes I/O requests—submission queue entries, or SQEs—into the submission ring. The kernel consumes these entries asynchronously, executes the requested operations, and writes completion queue entries into the completion ring. Because both rings reside in shared memory, this exchange can occur with zero system calls in the hot path when the kernel polls the submission queue directly. Enabling IORING_SETUP_SQPOLL dedicates a kernel thread to continuously drain submissions, eliminating even the io_uring_enter call that would otherwise be needed to notify the kernel of new work.

The efficiency gains over epoll are structural, not incremental. A typical epoll-based read requires at minimum two system calls: epoll_wait to detect readiness, then read to transfer data. io_uring collapses this into a single SQE. At 100,000 operations per second, eliminating one or two syscalls per operation removes up to 200,000 context switches per second. This is not a micro-optimization—it is a measurable reduction in CPU overhead that directly translates to higher throughput or lower tail latency under load.

io_uring also supports operation chaining through linked SQEs. A linked sequence of accept, read, and write can execute an entire request-response cycle without returning to user space between steps. Combined with fixed buffers—pre-registered memory regions that bypass per-operation page mapping—and fixed files—pre-registered descriptors that skip per-operation file table lookups—io_uring systematically eliminates the per-operation kernel bookkeeping that epoll-based architectures cannot avoid.

The trade-off is complexity and attack surface. io_uring grants user space a direct channel into kernel I/O paths, and this power has produced a series of privilege escalation vulnerabilities since its introduction. Several container runtimes and security-hardened deployments disable io_uring via seccomp filters. The engineering calculus is explicit: io_uring delivers the lowest-overhead async I/O available on Linux today, but accessing that performance envelope requires accepting its current security posture and investing in the operational discipline to mitigate it.

Takeaway
io_uring's fundamental insight is that the cheapest system call is the one you never make—shared-memory ring buffers between user space and the kernel eliminate the per-operation boundary crossing that every previous model accepted as unavoidable cost.

Proactor vs Reactor

The distinction between the reactor and proactor patterns is the most consequential architectural divide in async I/O design. In a reactor, the operating system notifies the application that a descriptor is ready for an operation—the application then performs the operation itself. In a proactor, the application submits the operation to the operating system, and the system notifies the application when the operation is complete. This single difference determines who owns I/O execution, who manages buffers during the operation, and where partial completion logic resides.

Windows I/O Completion Ports implement the proactor model natively. The application issues an overlapped I/O call—WSARecv, ReadFile—providing a pre-allocated buffer and a completion key. The kernel performs the I/O asynchronously and posts a completion packet to the port when the transfer finishes. Worker threads dequeue completions via GetQueuedCompletionStatus. Critically, the kernel manages thread scheduling directly, waking only enough threads to maintain a configurable target concurrency level. This solves the thread pool sizing problem at the OS layer rather than leaving it to application heuristics.

Traditional Unix async I/O—select, poll, epoll—follows the reactor model. The application receives readiness notifications, performs I/O with standard system calls, and handles partial reads or writes itself. This provides fine-grained control over buffer management and error recovery but pushes significant complexity into user space. Every reactor-based server must implement its own logic for EAGAIN handling, for managing read and write buffers across partial completions, and for deciding when to re-arm descriptors in edge-triggered mode.

io_uring blurs this historical boundary. While it runs on Linux alongside reactor-era tools, its submission-and-completion queue architecture is structurally a proactor. The application submits operations and the kernel completes them, posting results asynchronously. This makes io_uring closer in design philosophy to IOCP than to epoll. The implication is practical and immediate: frameworks designed around reactor semantics—early libuv, initial Tokio implementations—require non-trivial adaptation layers to exploit io_uring's completion-based model without abandoning their existing APIs.

The choice between these models is not abstract preference. Reactor architectures offer simpler mental models for applications needing step-by-step control over each I/O phase. Proactor architectures yield higher throughput ceilings by delegating I/O execution to the kernel, at the cost of pre-allocated buffers and out-of-order completion handling. The convergence of both major platforms toward completion-based designs—io_uring on Linux, IOCP on Windows—suggests an industry verdict: the kernel should own the I/O mechanics, and user space should own the application logic built on top.

Takeaway
The reactor asks 'is it ready?' while the proactor asks 'is it done?'—that single question about who performs the I/O reshapes buffer ownership, error handling strategy, and the achievable upper bound on throughput.

The progression from select to io_uring is not a story of linear improvement. It is a series of deliberate architectural decisions about where to place complexity—in user space or kernel space, in per-call scanning or persistent state, in readiness notification or completion notification. Each model occupies a distinct position in the design space, and the correct choice depends on connection count, operation mix, latency budget, and security constraints.

The unifying principle is amortization. epoll amortizes descriptor registration across calls. io_uring amortizes syscall overhead across batched submissions. IOCP amortizes thread management across completion dispatches. Every generation of async I/O found a new cost to spread across operations, and the performance gains followed directly from that insight.

If your system handles fewer than ten thousand connections, the differences between these models may be invisible in profiling data. If it handles millions, they determine whether the system works at all. Know where your system sits on that spectrum, and choose the model whose trade-offs align with your actual constraints.