Language is the most computationally demanding feat the human brain routinely performs. In roughly 600 milliseconds, auditory cortex transforms a pressure wave into a phoneme, a word, a syntactic structure, and a meaning—all while predicting what comes next. No engineered system replicates this with comparable efficiency, and no complete theoretical account yet explains how approximately 100 billion neurons conspire to make it happen.

What we do have is a rapidly maturing set of computational frameworks that are beginning to constrain the problem. From oscillatory models of hierarchical temporal integration to predictive coding architectures that cast comprehension as inference, theoretical neuroscience is converging on a picture of language processing that is both mathematically precise and neurobiologically plausible. These frameworks do not merely redescribe behavioral data—they generate falsifiable predictions about neural dynamics, cortical connectivity, and the information-theoretic signatures of linguistic computation.

This article examines three foundational pillars of this emerging picture. First, we consider how auditory cortex solves the temporal binding problem—integrating information across timescales spanning tens of milliseconds to several seconds, each corresponding to a distinct linguistic unit. Second, we explore how predictive processing transforms comprehension from passive decoding into active inference. Third, we confront what may be the hardest open question in the field: how neural circuits implement compositional semantics, combining discrete word meanings into structured propositional content. Together, these frameworks outline not just how the brain processes language, but what kind of computational system the brain must be to process language at all.

Hierarchical Temporal Processing

The speech signal is inherently multi-scale. Phonemic features unfold over 20–50 milliseconds, syllables over 100–300 milliseconds, words over roughly 300–500 milliseconds, and prosodic phrases over 500 milliseconds to several seconds. The brain does not simply sample this signal at one rate and reconstruct higher-order structure algorithmically. Instead, a compelling body of evidence suggests that cortical oscillations at distinct frequency bands—gamma (~30–80 Hz), theta (~4–8 Hz), and delta (~1–3 Hz)—selectively entrain to these different temporal granularities, creating a nested hierarchy of integration windows.

The theoretical elegance of this oscillatory framework lies in its natural implementation of temporal receptive fields. A gamma-entrained neuronal population in primary auditory cortex effectively computes over phoneme-length segments, while theta-band oscillations in superior temporal cortex integrate across syllabic units. Delta-band activity in higher-order regions—including anterior temporal lobe and inferior frontal gyrus—tracks phrase- and sentence-level boundaries. This nesting is not accidental; cross-frequency coupling, particularly theta-gamma coupling, provides the mechanism by which fine-grained phonemic representations are chunked into syllabic packets and passed upstream.

Formally, this can be modeled as a cascade of temporal integration kernels with progressively longer time constants. David Poeppel's Asymmetric Sampling in Time hypothesis captures this intuition: left hemisphere auditory cortex preferentially samples at short temporal windows (gamma), while right hemisphere cortex favors longer windows (theta). This hemispheric asymmetry maps neatly onto the well-established left-lateralization of phonological processing and the right hemisphere's role in prosodic and melodic integration.

What makes this framework computationally significant is that it solves the segmentation problem without requiring an explicit segmentation algorithm. Speech is continuous—there are no reliable acoustic markers between words, and coarticulation smears phoneme boundaries. By entraining oscillatory cycles to the quasi-rhythmic structure of speech, the brain imposes segmentation as an emergent property of its temporal dynamics. The oscillatory hierarchy does not merely reflect linguistic structure; it actively constructs the temporal scaffolding upon which higher-order linguistic computation depends.

Recent magnetoencephalography and electrocorticography studies have provided striking confirmation: neural tracking of linguistic units at each hierarchical level can be dissociated experimentally, and the strength of tracking at phrase and sentence levels correlates with comprehension accuracy. This is not simply an auditory phenomenon. It is a computational architecture—one in which the brain's intrinsic oscillatory repertoire has been co-opted, presumably through evolution and development, to parse the temporal statistics of human language.

Takeaway

The brain does not decode language from a single temporal stream. Instead, nested cortical oscillations create a hierarchy of integration windows, each tuned to a different linguistic timescale—effectively building grammatical structure from the physics of neural rhythms.

Predictive Language Models

Comprehension is not a passive process of decoding incoming signals. A growing theoretical consensus holds that the brain processes language through predictive coding—a framework in which cortical circuits continuously generate top-down predictions about forthcoming input and compute prediction errors when those expectations are violated. In the context of language, this means the brain is not merely recognizing words; it is anticipating them, leveraging syntactic, semantic, and pragmatic context to narrow the space of probable continuations before acoustic evidence arrives.

The mathematical formulation is rooted in hierarchical Bayesian inference. At each level of the cortical hierarchy, generative models encode prior probabilities over linguistic representations—phonemes, words, syntactic structures. These priors are propagated downward as predictions, while bottom-up signals carry only the residual error between predicted and actual input. The computational consequence is profound: when prediction is accurate, neural responses to expected words are suppressed, reducing metabolic cost and freeing processing resources. When prediction fails—as with semantically anomalous or syntactically surprising words—large prediction error signals propagate upward, triggering model revision.

This framework provides a principled explanation for some of the most robust findings in psycholinguistics and cognitive electrophysiology. The N400 component—an event-related potential peaking around 400 milliseconds post-stimulus—scales with a word's contextual predictability, which the predictive coding account interprets as a semantic prediction error signal. The P600, traditionally associated with syntactic anomalies, may reflect the cost of revising structural predictions. Under this interpretation, the classic distinction between semantic and syntactic processing dissolves into a unified architecture of hierarchical prediction and error correction.

Crucially, the brain's predictive models are not limited to simple word-level cloze probabilities. They encode structured predictions—anticipating not just which word is likely, but its syntactic category, its thematic role, and even its phonological form. Evidence from pre-activation studies shows that the brain begins retrieving the phonological properties of a predicted word before that word is encountered, implying that prediction operates over rich, multi-feature representations. This aligns with active inference formulations in which perception and action share a common generative model optimized to minimize free energy.

The computational advantages are substantial. Prediction effectively compresses the information bandwidth required for language comprehension. Rather than processing each word de novo, the system only needs to encode deviations from expectation—a principle familiar from information theory as predictive coding's connection to minimum description length. This may explain why language comprehension can proceed at rates of 150–250 words per minute with remarkably low error rates: the brain is not processing all of the information in the signal, only the information that matters—the surprisal.

Takeaway

The brain comprehends language not by passively decoding each word, but by continuously predicting what comes next and processing mainly the errors. Comprehension, at its core, is inference—and the signal that drives understanding is surprise.

Compositional Semantics Implementation

Perhaps the deepest computational puzzle in the neuroscience of language is compositionality: the ability to combine a finite set of word meanings into a potentially infinite set of sentence meanings according to systematic structural rules. We understand the dog bit the man differently from the man bit the dog, despite identical word sets. Any neural account of language must explain how the brain implements structure-sensitive semantic composition—and this remains the field's most formidable open problem.

Several theoretical proposals have been advanced. One influential framework draws on tensor product representations, originally developed by Paul Smolensky. In this formalism, compositional structures are encoded as the outer product of role vectors and filler vectors in a high-dimensional neural state space. The word dog bound to the agent role produces a different distributed representation than dog bound to the patient role, and the sentence meaning is the superposition of these role-filler bindings. This scheme preserves both constituent identity and structural position—the two requirements for genuine compositionality—while remaining implementable in neural population codes.

An alternative approach draws on vector symbolic architectures, including holographic reduced representations and multiply-add-permute models. These frameworks use high-dimensional vectors with specific algebraic operations—circular convolution, element-wise multiplication, permutation—to bind and unbind semantic constituents. The appeal is their neurobiological plausibility: high-dimensional distributed representations are ubiquitous in cortex, and the binding operations can be approximated by known neural mechanisms including phase-synchrony and burst-timing codes.

Empirically, the neural correlates of semantic composition are beginning to be localized. Functional imaging studies consistently implicate the left anterior temporal lobe and angular gyrus in compositional operations—regions that activate more strongly for meaningful two-word phrases (red boat) than for non-compositional controls (xkq boat). Intracranial recordings suggest that composition involves a transient increase in inter-regional information transfer, potentially mediated by theta-band coherence between temporal and frontal sites. This fits the picture of composition as a dynamic binding operation coordinated across distributed cortical networks.

The theoretical challenge that persists is variable binding—the brain's apparent ability to assign arbitrary fillers to arbitrary roles in real time, without dedicated hardware for each possible combination. Classical symbolic architectures handle this trivially with registers and pointers, but neural systems have no obvious equivalent. Solving the variable binding problem in a neurally realistic framework would represent a major advance not just for language neuroscience, but for understanding how the brain implements any form of systematic, productive thought. It is, in a deep sense, the problem of explaining how neural matter thinks in propositions.

Takeaway

Compositionality—the ability to build infinite meanings from finite parts according to structural rules—is what makes human language truly powerful. Explaining how neural circuits implement this remains the field's hardest open problem, and solving it would illuminate not just language, but the nature of thought itself.

The three frameworks examined here—hierarchical temporal processing, predictive coding, and compositional semantics—are not independent modules. They are interlocking components of what must ultimately be a unified computational architecture for language. Oscillatory hierarchies provide the temporal scaffold. Predictive inference supplies the computational logic. Compositional operations generate the structured meanings that predictions operate over.

What emerges is a picture of language processing as perhaps the brain's most revealing computational signature—a system that demands multi-scale temporal integration, real-time Bayesian inference, and dynamic variable binding, all executed within the constraints of biological neural networks operating on millisecond timescales.

The theoretical gaps remain substantial. We lack a complete account of how these three mechanisms are neurally coordinated, and the variable binding problem continues to resist neurobiologically satisfying solutions. But the convergence of computational theory, neuroimaging, and electrophysiology is narrowing the space of viable architectures. The question is no longer whether the brain computes language—but precisely how a system built from neurons achieves the most compositional, predictive, and hierarchically structured computation in the known universe.