The Mathematics of Optimal Transport in Machine Learning

6 min read

Classical divergences fail when distributions share little support, motivating geometric alternatives rooted in optimal transport theory.

The Kantorovich formulation relaxes Monge's deterministic problem into a linear program over couplings, yielding the Wasserstein distance as a true metric on probability measures.

Sinkhorn's algorithm uses entropic regularization to convert the cubic linear program into a differentiable matrix-scaling iteration with geometric convergence.

Wasserstein GANs exploit Kantorovich-Rubinstein duality to deliver stable, informative gradients, while Brenier's theorem connects optimal transport to normalizing flows.

Domain adaptation, generative modeling, and structured prediction all benefit from geometry-aware distribution matching that respects the metric structure of the underlying data.

Comparing probability distributions sits at the heart of modern machine learning, yet most classical divergences—Kullback-Leibler, total variation, Jensen-Shannon—exhibit pathological behavior when distributions share little support. They saturate, vanish, or explode precisely where we need gradients to flow. This failure mode has motivated a return to a much older mathematical framework, one whose origins predate information theory by nearly two centuries.

Optimal transport, formulated by Gaspard Monge in 1781 and reformulated by Leonid Kantorovich in 1942, asks a deceptively simple question: what is the minimum cost of rearranging one distribution into another? The resulting geometry—the Wasserstein space of probability measures—endows the manifold of distributions with a metric that respects the underlying geometry of the sample space itself.

For machine learning, this geometric sensitivity is transformative. Where KL divergence is blind to the metric structure of the data, Wasserstein distances measure how far mass must travel. This single property unlocks principled approaches to generative modeling, domain adaptation, and structured prediction. In what follows, we examine the Kantorovich formulation, the entropic regularization that made optimal transport computationally tractable, and the algorithmic innovations that have placed it at the center of contemporary deep learning research.

The Kantorovich Formulation and Wasserstein Geometry

Monge's original problem sought a deterministic transport map T pushing a source measure μ onto a target ν while minimizing total transportation cost. The formulation is geometrically elegant but mathematically fragile—such a map need not exist when μ contains atoms that must be split across the target.

Kantorovich's relaxation resolves this by replacing maps with couplings: joint distributions π on the product space whose marginals recover μ and ν. The problem becomes a linear program: minimize the expected cost ∫ c(x,y) dπ(x,y) over the polytope Π(μ,ν) of admissible couplings. Existence follows from weak compactness, and duality yields the celebrated Kantorovich-Rubinstein theorem.

When the cost is a metric raised to power p, the optimal value defines the p-Wasserstein distance W_p(μ,ν). This is a genuine metric on probability measures, and crucially, it metrizes weak convergence on compact spaces. Unlike f-divergences, W_p remains finite and continuous even when supports are disjoint.

The dual formulation reveals deeper structure. The Kantorovich-Rubinstein duality expresses W₁ as a supremum over 1-Lipschitz functions—an observation that would later prove central to Wasserstein GANs. For quadratic cost, Brenier's theorem identifies the optimal coupling as the gradient of a convex potential, connecting optimal transport to Monge-Ampère equations and convex analysis.

This geometric perspective transforms how we reason about distributions. The space of measures becomes a Riemannian-like manifold where geodesics correspond to displacement interpolations—continuous deformations that preserve mass while moving it along optimal paths.

Takeaway
When distributions live in metric spaces, the geometry of their differences matters as much as their statistical separation. Wasserstein distance honors this by measuring transport, not overlap.

Sinkhorn's Algorithm and Entropic Regularization

The linear program defining Wasserstein distance scales as O(n³ log n) for discrete measures of size n—prohibitive for the high-dimensional, large-batch regime of modern deep learning. Cuturi's 2013 reformulation introduced entropic regularization: add ε·H(π) to the objective, where H is the negative entropy of the coupling.

This seemingly minor modification transforms the problem. The regularized objective becomes strictly convex, and its optimum admits a closed-form structure: the optimal coupling is π* = diag(u) K diag(v), where K = exp(-C/ε) is the Gibbs kernel and u, v are scaling vectors enforcing the marginal constraints.

Sinkhorn's algorithm computes these scalings by alternating projections: u ← a / (Kv) and v ← b / (K^Tu). Each iteration costs O(n²), and the entire procedure is differentiable—a critical property for end-to-end learning. Convergence is geometric, with rate governed by the Hilbert projective metric contraction of the linear maps.

The regularization parameter ε mediates a fundamental tradeoff. Small ε approximates true optimal transport but yields ill-conditioned kernels and slow convergence. Large ε produces blurred, diffuse couplings that approach independent products. Practical applications typically use moderate ε with log-domain stabilization to avoid numerical overflow.

Recent extensions have pushed boundaries further: unbalanced optimal transport relaxes the marginal constraints with KL penalties, multi-marginal formulations handle barycenters and Wasserstein generalizations, and stochastic Sinkhorn variants scale to streaming data through mini-batch estimators.

Takeaway
Regularization is not merely a numerical convenience—it can fundamentally restructure a problem's geometry, transforming intractable combinatorics into smooth, differentiable optimization.

Optimal Transport in Generative Modeling and Domain Adaptation

Wasserstein GANs exploit the Kantorovich-Rubinstein duality to replace the Jensen-Shannon divergence of classical GANs with W₁. The discriminator becomes a Lipschitz-constrained critic estimating the dual potential, yielding gradients that remain informative even when generator and data distributions are far apart. This eliminates mode collapse mechanisms tied to vanishing gradients and provides a meaningful training signal correlated with sample quality.

Enforcing the Lipschitz constraint has driven substantial algorithmic innovation. Weight clipping (the original WGAN) is crude; gradient penalty methods (WGAN-GP) penalize deviations of the critic's gradient norm from unity; spectral normalization constrains layer-wise operator norms. Each represents a different compromise between fidelity to the dual problem and optimization tractability.

Beyond GANs, optimal transport underlies normalizing flows through Brenier's theorem—the convex potential whose gradient maps source to target is, in principle, a continuous flow learnable through neural parameterizations. Recent work on flow matching and rectified flows operationalizes this connection, training neural networks to approximate the displacement interpolation between distributions.

Domain adaptation problems—where source and target distributions differ but share a learning task—find a natural formulation through optimal transport. By transporting source samples to align with target geometry, classifiers trained on labeled source data generalize to unlabeled target domains. Joint distribution optimal transport extends this to align both features and conditional label distributions simultaneously.

These applications share a common insight: many learning problems can be recast as distribution alignment, and the choice of geometry governing that alignment determines what the model can learn. Optimal transport provides a principled, geometry-aware default.

Takeaway
The metric you choose to compare distributions implicitly defines what your model considers similar. Optimal transport asks the geometry of your data to participate in that judgment.

Optimal transport has matured from a problem in eighteenth-century civil engineering into a foundational toolkit for modern machine learning. Its rise reflects a broader shift: the recognition that geometric structure on the space of distributions is not a luxury but a prerequisite for principled inference.

The Wasserstein framework offers something rare—a mathematically deep theory whose computational machinery, after Sinkhorn's regularization, scales to the problems practitioners actually face. The bridge from Kantorovich duality to GAN training, from Brenier maps to normalizing flows, demonstrates that abstract theorems can yield concrete algorithms when the right computational lens is applied.

What remains open is substantial: optimal transport in high dimensions still suffers from the curse of dimensionality, sliced and projected variants offer partial remedies, and the interplay between regularization and statistical estimation continues to generate active research. For the methodologically curious, this is fertile ground where deep mathematics meets immediate practical consequence.