Comparing probability distributions sits at the heart of modern machine learning, yet most classical divergences—Kullback-Leibler, total variation, Jensen-Shannon—exhibit pathological behavior when distributions share little support. They saturate, vanish, or explode precisely where we need gradients to flow. This failure mode has motivated a return to a much older mathematical framework, one whose origins predate information theory by nearly two centuries.
Optimal transport, formulated by Gaspard Monge in 1781 and reformulated by Leonid Kantorovich in 1942, asks a deceptively simple question: what is the minimum cost of rearranging one distribution into another? The resulting geometry—the Wasserstein space of probability measures—endows the manifold of distributions with a metric that respects the underlying geometry of the sample space itself.
For machine learning, this geometric sensitivity is transformative. Where KL divergence is blind to the metric structure of the data, Wasserstein distances measure how far mass must travel. This single property unlocks principled approaches to generative modeling, domain adaptation, and structured prediction. In what follows, we examine the Kantorovich formulation, the entropic regularization that made optimal transport computationally tractable, and the algorithmic innovations that have placed it at the center of contemporary deep learning research.
The Kantorovich Formulation and Wasserstein Geometry
Monge's original problem sought a deterministic transport map T pushing a source measure μ onto a target ν while minimizing total transportation cost. The formulation is geometrically elegant but mathematically fragile—such a map need not exist when μ contains atoms that must be split across the target.
Kantorovich's relaxation resolves this by replacing maps with couplings: joint distributions π on the product space whose marginals recover μ and ν. The problem becomes a linear program: minimize the expected cost ∫ c(x,y) dπ(x,y) over the polytope Π(μ,ν) of admissible couplings. Existence follows from weak compactness, and duality yields the celebrated Kantorovich-Rubinstein theorem.
When the cost is a metric raised to power p, the optimal value defines the p-Wasserstein distance Wp(μ,ν). This is a genuine metric on probability measures, and crucially, it metrizes weak convergence on compact spaces. Unlike f-divergences, Wp remains finite and continuous even when supports are disjoint.
The dual formulation reveals deeper structure. The Kantorovich-Rubinstein duality expresses W1 as a supremum over 1-Lipschitz functions—an observation that would later prove central to Wasserstein GANs. For quadratic cost, Brenier's theorem identifies the optimal coupling as the gradient of a convex potential, connecting optimal transport to Monge-Ampère equations and convex analysis.
This geometric perspective transforms how we reason about distributions. The space of measures becomes a Riemannian-like manifold where geodesics correspond to displacement interpolations—continuous deformations that preserve mass while moving it along optimal paths.
TakeawayWhen distributions live in metric spaces, the geometry of their differences matters as much as their statistical separation. Wasserstein distance honors this by measuring transport, not overlap.
Sinkhorn's Algorithm and Entropic Regularization
The linear program defining Wasserstein distance scales as O(n³ log n) for discrete measures of size n—prohibitive for the high-dimensional, large-batch regime of modern deep learning. Cuturi's 2013 reformulation introduced entropic regularization: add ε·H(π) to the objective, where H is the negative entropy of the coupling.
This seemingly minor modification transforms the problem. The regularized objective becomes strictly convex, and its optimum admits a closed-form structure: the optimal coupling is π* = diag(u) K diag(v), where K = exp(-C/ε) is the Gibbs kernel and u, v are scaling vectors enforcing the marginal constraints.
Sinkhorn's algorithm computes these scalings by alternating projections: u ← a / (Kv) and v ← b / (KTu). Each iteration costs O(n²), and the entire procedure is differentiable—a critical property for end-to-end learning. Convergence is geometric, with rate governed by the Hilbert projective metric contraction of the linear maps.
The regularization parameter ε mediates a fundamental tradeoff. Small ε approximates true optimal transport but yields ill-conditioned kernels and slow convergence. Large ε produces blurred, diffuse couplings that approach independent products. Practical applications typically use moderate ε with log-domain stabilization to avoid numerical overflow.
Recent extensions have pushed boundaries further: unbalanced optimal transport relaxes the marginal constraints with KL penalties, multi-marginal formulations handle barycenters and Wasserstein generalizations, and stochastic Sinkhorn variants scale to streaming data through mini-batch estimators.
TakeawayRegularization is not merely a numerical convenience—it can fundamentally restructure a problem's geometry, transforming intractable combinatorics into smooth, differentiable optimization.
Optimal Transport in Generative Modeling and Domain Adaptation
Wasserstein GANs exploit the Kantorovich-Rubinstein duality to replace the Jensen-Shannon divergence of classical GANs with W1. The discriminator becomes a Lipschitz-constrained critic estimating the dual potential, yielding gradients that remain informative even when generator and data distributions are far apart. This eliminates mode collapse mechanisms tied to vanishing gradients and provides a meaningful training signal correlated with sample quality.
Enforcing the Lipschitz constraint has driven substantial algorithmic innovation. Weight clipping (the original WGAN) is crude; gradient penalty methods (WGAN-GP) penalize deviations of the critic's gradient norm from unity; spectral normalization constrains layer-wise operator norms. Each represents a different compromise between fidelity to the dual problem and optimization tractability.
Beyond GANs, optimal transport underlies normalizing flows through Brenier's theorem—the convex potential whose gradient maps source to target is, in principle, a continuous flow learnable through neural parameterizations. Recent work on flow matching and rectified flows operationalizes this connection, training neural networks to approximate the displacement interpolation between distributions.
Domain adaptation problems—where source and target distributions differ but share a learning task—find a natural formulation through optimal transport. By transporting source samples to align with target geometry, classifiers trained on labeled source data generalize to unlabeled target domains. Joint distribution optimal transport extends this to align both features and conditional label distributions simultaneously.
These applications share a common insight: many learning problems can be recast as distribution alignment, and the choice of geometry governing that alignment determines what the model can learn. Optimal transport provides a principled, geometry-aware default.
TakeawayThe metric you choose to compare distributions implicitly defines what your model considers similar. Optimal transport asks the geometry of your data to participate in that judgment.
Optimal transport has matured from a problem in eighteenth-century civil engineering into a foundational toolkit for modern machine learning. Its rise reflects a broader shift: the recognition that geometric structure on the space of distributions is not a luxury but a prerequisite for principled inference.
The Wasserstein framework offers something rare—a mathematically deep theory whose computational machinery, after Sinkhorn's regularization, scales to the problems practitioners actually face. The bridge from Kantorovich duality to GAN training, from Brenier maps to normalizing flows, demonstrates that abstract theorems can yield concrete algorithms when the right computational lens is applied.
What remains open is substantial: optimal transport in high dimensions still suffers from the curse of dimensionality, sliced and projected variants offer partial remedies, and the interplay between regularization and statistical estimation continues to generate active research. For the methodologically curious, this is fertile ground where deep mathematics meets immediate practical consequence.