Classical deep learning architectures assume regular, grid-like data: pixels on a lattice, tokens in a sequence. Yet the most interesting structures in nature and society refuse such regularity. Molecules, social networks, citation graphs, and protein interactomes live on irregular domains where neighborhoods vary in size and permutation symmetries must be respected.
Graph neural networks emerged to fill this gap, but their early formulations often felt like heuristic extensions of convolutional intuitions. The modern view is more principled. By treating graphs as first-class mathematical objects and demanding equivariance to node permutations, we can derive a remarkably general family of architectures from a small set of axioms.
In this article, we examine three complementary lenses on GNN theory. We begin with the message passing framework, which provides the canonical algorithmic skeleton. We then connect spatial aggregation to spectral graph theory, revealing how convolutions on graphs correspond to filtering on the Laplacian spectrum. Finally, we confront the expressivity question through the Weisfeiler-Lehman hierarchy, which delineates precisely which graph structures a GNN can and cannot distinguish. Together, these perspectives transform graph learning from a collection of tricks into a coherent mathematical discipline.
Message Passing as First Principles
Consider a graph G = (V, E) with node features x_v and optional edge features e_uv. We seek a function that maps this structure to node representations h_v while respecting the permutation symmetry of the vertex set. Any such function, under mild regularity assumptions, admits a decomposition into local aggregation and update operations.
The canonical message passing neural network (MPNN) formalism, introduced by Gilmer and collaborators, captures this decomposition in two equations. At layer k, each node computes m_v = AGG({ψ(h_u, h_v, e_uv) : u ∈ N(v)}), then updates via h_v' = φ(h_v, m_v). The aggregator AGG must be permutation-invariant; sums, means, and maxima are the usual suspects.
This framework subsumes a surprising range of architectures. Graph Convolutional Networks correspond to a symmetric-normalized sum aggregator with linear messages. GraphSAGE generalizes the aggregator to include concatenation-based combinations. Graph Attention Networks introduce learned, input-dependent weights α_uv that modulate messages before aggregation.
The elegance of the formulation lies in what it enforces and what it leaves free. Permutation equivariance is baked in structurally. The choice of ψ and φ, however, determines inductive bias, expressivity, and computational cost—decisions that must be made with the target task in mind.
Crucially, k rounds of message passing propagate information across k-hop neighborhoods. This locality is both a strength and a limitation. It mirrors the receptive field logic of CNNs but exposes GNNs to pathologies such as over-smoothing and over-squashing when depth or graph diameter grows.
TakeawayMessage passing is not a design choice but a mathematical consequence of demanding permutation equivariance on local neighborhoods. Every GNN architecture is ultimately a selection of aggregation and update functions within this inescapable template.
The Spectral Perspective
A complementary view emerges when we analyze graph signals through the lens of spectral graph theory. For an undirected graph, the combinatorial Laplacian L = D - A, or its normalized variant L = I - D^(-1/2) A D^(-1/2), is symmetric and positive semi-definite, admitting an eigendecomposition L = UΛU^T.
The eigenvectors U form a graph Fourier basis. A signal x on vertices transforms to x̂ = U^T x, and convolution with filter g becomes pointwise multiplication in the spectral domain: g * x = U g(Λ) U^T x. Small eigenvalues encode smooth, low-frequency structure; large eigenvalues capture oscillatory, high-frequency variation across edges.
Direct spectral filtering is computationally prohibitive because it requires the full eigendecomposition. Defferrard and collaborators circumvented this with Chebyshev polynomial approximations g(L) ≈ Σ θ_k T_k(L̃), yielding filters that are K-localized in the vertex domain while remaining spectrally meaningful.
Kipf and Welling's GCN is the first-order truncation of this construction, with additional renormalization for numerical stability. What appears as a simple spatial aggregator is, viewed spectrally, a low-pass filter that attenuates high-frequency components of the graph signal.
This duality illuminates phenomena otherwise mysterious. Over-smoothing—the collapse of node representations toward a common value under repeated message passing—is precisely the iterative amplification of the smallest eigenvalue's eigenspace. Understanding this invites architectural remedies: residual connections, spectral regularization, and filters designed to preserve mid-frequency information.
TakeawaySpatial and spectral formulations are two projections of a single object. Reasoning about GNN behavior often requires switching perspectives, using whichever domain renders the pathology or opportunity most legible.
Expressivity and the Weisfeiler-Lehman Ceiling
A fundamental question: given unbounded parameters and depth, what graphs can a GNN distinguish? The answer, established by Xu et al. and Morris et al., is surprisingly precise. Standard message-passing GNNs are at most as expressive as the one-dimensional Weisfeiler-Lehman (1-WL) color refinement algorithm.
The 1-WL test iteratively refines node colors by hashing each node's color together with the multiset of neighbor colors. Two graphs are deemed potentially isomorphic if their color multisets remain identical across iterations. The test is known to fail on regular graphs of the same degree, where all nodes share an indistinguishable local structure.
This bound is not incidental. Any MPNN computes, at each layer, a function of a node's feature and the multiset of its neighbors' features—exactly the information available to 1-WL. If the aggregation or update function is not injective on multisets, the bound tightens further. The Graph Isomorphism Network (GIN) achieves the 1-WL ceiling by using sum aggregation and an MLP update, since sums over countable multisets admit injective encodings.
To exceed 1-WL expressivity, architectural innovations are required. Higher-order GNNs operate on k-tuples of nodes, climbing the k-WL hierarchy at exponential computational cost. Subgraph GNNs encode each node by the representation of its surrounding subgraph. Positional and structural encodings, such as Laplacian eigenvectors or random walk features, inject information that is invisible to pure 1-WL.
These results reframe GNN design as a principled trade-off on the expressivity-efficiency frontier. The question is no longer whether an architecture works empirically, but precisely where it sits in the combinatorial hierarchy of graph-distinguishing power.
TakeawayExpressivity is not monolithic but stratified. Knowing which isomorphism classes your architecture can separate is as essential as knowing the function classes a feedforward network can approximate.
Graph neural networks are best understood not as a single architecture but as a principled response to the geometry of relational data. The message passing framework provides algorithmic structure, spectral theory illuminates their filtering behavior, and the Weisfeiler-Lehman hierarchy bounds their discriminative power.
These three perspectives are mutually reinforcing. A spectral critique may reveal over-smoothing; a WL analysis may diagnose structural blindness; a message passing view may suggest architectural repairs. Mastery of graph learning requires fluency in all three languages, not allegiance to any one.
As the field advances toward equivariant architectures on manifolds, simplicial complexes, and higher categorical structures, the lessons of GNN theory generalize. The core discipline—derive your architecture from symmetry, analyze it through multiple mathematical lenses, and characterize its expressivity rigorously—remains the blueprint for principled geometric deep learning.