Neural Tangent Kernels: When Networks Behave Like Linear Models

7 min read

At infinite width, neural networks converge to Gaussian processes governed by a deterministic kernel function called the neural tangent kernel.

This kernel remains constant during training, reducing network optimization to kernel regression with explicit convergence guarantees.

In the lazy training regime, network weights barely move from initialization even while the network achieves zero training error.

The NTK's spectral properties explain why networks learn low-frequency functions before high-frequency patterns.

Finite-width networks escape the kernel regime through feature learning—the essential capability that NTK theory cannot capture.

Deep learning's empirical success has long outpaced our theoretical understanding. Networks with millions of parameters somehow generalize despite classical theory predicting catastrophic overfitting. Optimization landscapes that should trap gradient descent in local minima instead yield global solutions with surprising regularity. The neural tangent kernel framework offers a partial resolution to these puzzles—by revealing conditions under which neural networks become mathematically equivalent to well-understood kernel methods.

The key insight emerged from studying networks in the infinite-width limit. When hidden layer widths grow without bound, something remarkable happens: the network's behavior becomes governed by a fixed kernel function determined entirely by the architecture at initialization. Training dynamics reduce to kernel regression, and the mysteries of deep learning give way to the tractable mathematics of reproducing kernel Hilbert spaces.

This linearization isn't merely a mathematical curiosity. It explains why overparameterized networks train so easily, why certain architectures learn faster than others, and why networks can interpolate training data while maintaining reasonable generalization. Yet the NTK regime also reveals its own limitations—the very properties that make infinite-width networks analyzable may explain why practical networks, operating far from this limit, achieve capabilities that kernel methods cannot match. Understanding when neural networks behave like linear models illuminates both the power and the boundaries of this theoretical framework.

Infinite Width Limit

Consider a fully-connected network with hidden layers of width n. Initialize weights with variance scaling inversely proportional to layer width—the standard He or Xavier initialization. As n approaches infinity, the Central Limit Theorem applies at each layer: pre-activations become Gaussian processes. The network output, viewed as a function of its inputs, converges to a Gaussian process whose covariance structure depends only on the architecture.

The neural tangent kernel captures how this function changes during training. Define Θ(x, x') as the inner product of gradients with respect to all parameters, evaluated at inputs x and x'. In the infinite-width limit, this kernel converges to a deterministic function at initialization and—crucially—remains constant throughout training. The NTK depends on activation functions, depth, and architectural choices like skip connections, but not on the particular random initialization.

For a single hidden layer with ReLU activation, the NTK takes a closed form involving the arc-cosine kernel. Deeper networks yield kernels computed recursively, where each layer's contribution builds on the previous. This compositional structure explains why depth matters: deeper NTKs capture increasingly complex input relationships, with each layer performing a specific transformation of the similarity measure inherited from below.

The correspondence with kernel methods becomes exact under gradient flow dynamics. If we train with infinitesimal learning rate, the network output evolves according to f(x, t) = Σᵢ αᵢ(t) Θ(x, xᵢ), where the coefficients solve a linear system determined by the kernel matrix and labels. This is precisely kernel regression with the NTK as the kernel. Convergence guarantees follow from kernel theory: if the NTK matrix is positive definite, gradient descent finds a global minimum at an exponential rate.

The NTK's spectrum determines learning speed for different functions. Eigencomponents with large eigenvalues are learned quickly; those with small eigenvalues require exponentially more training time. This spectral bias toward smooth functions explains why networks learn low-frequency patterns before high-frequency details—a phenomenon observed empirically long before the theoretical explanation emerged. Architecture choices that modify the NTK's spectral properties directly shape what the network can efficiently learn.

Takeaway
At infinite width, a neural network's evolution during training is entirely characterized by a fixed kernel function determined at initialization—transforming the mysteries of deep learning into the tractable mathematics of kernel methods.

Lazy Training Regime

The NTK framework reveals a counterintuitive phenomenon: in sufficiently wide networks, weights barely move from their initialized values, yet the network achieves zero training error. This lazy training regime occurs because individual weight changes scale as O(1/√n) with width, while their collective effect on the output remains O(1). The network function changes substantially even as each parameter stays nearly fixed.

Quantifying this: if θ₀ denotes initial parameters and θ* the trained parameters, the relative change ||θ* - θ₀|| / ||θ₀|| vanishes as width increases. Yet the function space distance ||f(·, θ*) - f(·, θ₀)|| remains finite and precisely equal to the projection of the target onto the NTK feature space. The network explores a tiny neighborhood in parameter space while traversing a meaningful trajectory in function space.

This separation of scales explains several empirical observations. Wide networks train faster because their NTK matrices have larger minimum eigenvalues—the condition number improves with width. Optimization becomes easier not because the loss landscape simplifies geometrically, but because the effective dynamics become linear. Gradient descent cannot get stuck in spurious local minima because, in the linearized regime, there are none.

The lazy regime also clarifies why different parameterizations matter. The standard parameterization (scaling output by 1/√n) yields the lazy regime where NTK stays constant. The mean field parameterization (scaling by 1/n) allows weights to change substantially, enabling feature learning beyond what any fixed kernel can represent. The choice is not merely technical—it determines whether your network operates as a kernel method or something fundamentally more powerful.

Detecting the lazy regime in practice requires examining weight movement relative to initialization. If trained weights remain within a small ball around initialization—small meaning O(1/√n) in normalized distance—the NTK approximation holds. Modern architectures operating at practical widths often violate this condition, suggesting they benefit from phenomena the NTK framework cannot capture. The lazy regime provides a baseline: understanding what networks do when constrained to linear dynamics illuminates what additional capabilities emerge when they escape this constraint.

Takeaway
In the lazy regime, optimization succeeds not because networks navigate complex landscapes, but because the dynamics become linear—revealing that sufficiently wide networks solve a fundamentally different, simpler problem than their finite-width counterparts.

Limitations and Departures

The NTK framework's elegance comes at a cost: it cannot explain many phenomena central to deep learning's success. Most fundamentally, kernel methods—including infinite-width networks—suffer from a curse of dimensionality in their sample complexity. They require exponentially many samples to learn functions that finite-width networks learn from modest data. Something beyond the kernel regime must explain practical networks' data efficiency.

That something is feature learning. Finite-width networks adapt their internal representations during training, effectively learning a data-dependent kernel rather than relying on a fixed one. The NTK stays constant by construction; real networks develop increasingly useful features as training progresses. This representational adaptation—precisely what the lazy regime prohibits—appears essential for strong performance on complex tasks.

Empirical evidence for departure from the NTK regime is extensive. Networks trained to low loss show weight movement that scales with width more slowly than the lazy regime predicts. Intermediate layer representations change qualitatively during training, developing structure absent at initialization. Transfer learning works—features learned on one task improve performance on others—which contradicts the NTK picture where useful information resides only in the output layer's weights.

The gap between NTK predictions and practical performance widens with depth and task complexity. Shallow networks on simple tasks often match NTK predictions closely. Deep networks on vision or language tasks show systematic deviations: faster learning, better generalization, and qualitatively different solutions than any kernel method achieves. Recent theoretical work on rich regimes and maximal update parameterizations attempts to characterize these departures, though a complete theory remains elusive.

Understanding the NTK's limitations sharpens our questions. If kernel methods cannot explain deep learning's success, what can? The answer likely involves understanding how finite-width effects enable efficient feature learning, how depth creates useful representational structure, and why particular architectures discover transferable features. The NTK provides a precisely characterized null hypothesis—what networks would do if they were merely high-dimensional linear models. Explaining what they do beyond this baseline remains the central challenge of deep learning theory.

Takeaway
The NTK framework precisely characterizes what neural networks cannot explain—feature learning, data efficiency, and representation adaptation—thereby clarifying that the essence of deep learning's power lies exactly in departing from the kernel regime.

The neural tangent kernel framework achieved something rare in deep learning theory: an exact characterization of network behavior under limiting conditions. By showing that infinite-width networks reduce to kernel methods, it revealed why overparameterized networks train easily, explained the spectral bias toward smooth functions, and established convergence guarantees rooted in classical mathematics.

Yet the framework's very success highlights what remains unexplained. Practical networks operate outside the lazy regime, learning features that no fixed kernel can represent. The gap between kernel predictions and deep learning performance grows with the complexity of tasks where neural networks excel most dramatically.

The NTK thus serves as both foundation and foil—a baseline showing what networks achieve through linear dynamics, against which we measure everything they accomplish by escaping it. The next generation of theory must characterize the mechanisms of feature learning that make finite-width networks fundamentally more powerful than their infinite-width shadows.