Generative adversarial networks arrived with a deceptively simple premise: pit two neural networks against each other and let competition drive learning. The generator fabricates synthetic samples, the discriminator tries to distinguish real from fake, and through this adversarial game, the generator learns to produce data indistinguishable from the true distribution. The elegance of this framing obscured, for a time, the deep mathematical machinery operating beneath it.

But training GANs has always been notoriously difficult. Mode collapse, oscillating losses, vanishing gradients—these aren't merely engineering inconveniences. They are symptoms of fundamental properties of the optimization landscape GANs inhabit. Understanding why requires moving beyond the intuitive two-player metaphor and into the formal language of divergence measures, Nash equilibria, and optimal transport theory.

This article examines the mathematical foundations of GAN training as minimax optimization over probability distributions. We will derive the original objective's connection to the Jensen-Shannon divergence, analyze the equilibrium conditions that govern convergence, and explore how the Wasserstein distance offers a principled alternative that resolves several pathologies of the original formulation. The goal is not a survey of GAN architectures but a rigorous examination of why the mathematics works when it does—and why it fails when it doesn't.

Jensen-Shannon Divergence

The original GAN objective, as formulated by Goodfellow et al. (2014), is a minimax game: the discriminator D maximizes the expected log-probability of correctly classifying real and generated samples, while the generator G minimizes the same quantity. Formally, the value function is V(D, G) = 𝔼ₓ∼p_data[log D(x)] + 𝔼ₓ∼p_g[log(1 − D(x))]. This expression appears straightforward, but its theoretical implications run deep.

For a fixed generator, the optimal discriminator takes the form D*(x) = p_data(x) / (p_data(x) + p_g(x)). Substituting this optimal discriminator back into the value function yields a quantity that depends solely on the two distributions. After algebraic manipulation, the resulting expression is −log 4 + 2 · JSD(p_data ‖ p_g), where JSD denotes the Jensen-Shannon divergence. The minimax game, at optimality, is equivalent to minimizing the JS divergence between the real and generated distributions.

The Jensen-Shannon divergence has an appealing geometric interpretation. It is the average of two KL divergences, each measuring the cost of encoding one distribution using a mixture model. Unlike the asymmetric KL divergence, JSD is symmetric and bounded between 0 and log 2. This boundedness is both a strength and a weakness—it ensures the divergence is always finite, but it also means that when the supports of p_data and p_g are disjoint, JSD saturates at its maximum value, providing zero useful gradient information.

This saturation is the root cause of the vanishing gradient problem in early GAN training. When the generator is poorly initialized and produces samples far from the data manifold, the discriminator achieves near-perfect classification with high confidence. The JS divergence is effectively maxed out across a wide region of parameter space, and the generator receives no meaningful signal for improvement. The loss landscape becomes flat precisely where learning needs to begin.

The geometric picture is clarifying: the JS divergence measures overlap between distributions, but real-world data concentrates on low-dimensional manifolds embedded in high-dimensional spaces. Two such manifolds generically have measure-zero intersection. In this regime, the JS divergence is locally constant and its gradient vanishes. The original GAN objective is blind to the distance between non-overlapping distributions—it can tell you they're different, but not how to make them closer.

Takeaway

The original GAN objective is secretly a Jensen-Shannon divergence minimization, and this divergence goes flat when distributions don't overlap—which is exactly the regime where training begins. The objective can detect a problem but cannot point toward the solution.

Equilibrium and Stability

GAN training seeks a Nash equilibrium of the minimax game: a pair (G*, D*) where neither player can improve unilaterally. At the theoretical optimum, the generator perfectly reproduces the data distribution (p_g = p_data) and the discriminator outputs 1/2 everywhere, unable to distinguish real from fake. This equilibrium exists and is unique under mild regularity conditions. The question is whether gradient-based optimization can reach it.

Standard GAN training alternates gradient descent on the generator and gradient ascent on the discriminator—a procedure known as simultaneous gradient descent on the minimax objective. For convex-concave games, such procedures converge reliably. But the GAN game is emphatically not convex-concave. The generator's loss landscape, parameterized by neural network weights, is highly non-convex, and the coupling between the two players introduces rotational dynamics in the joint parameter space.

These rotational dynamics are well-characterized in simple bilinear games, where simultaneous gradient descent orbits the equilibrium rather than converging to it. In the GAN setting, this manifests as oscillatory training behavior: the discriminator learns to exploit a weakness in the generator, the generator adjusts, the discriminator finds a new weakness, and the cycle continues without convergence. The eigenvalues of the Jacobian of the gradient vector field at equilibrium have significant imaginary components, driving these oscillations.

Mode collapse is another equilibrium failure, and arguably the most pernicious. Rather than spreading probability mass across the full data distribution, the generator concentrates on a small number of modes that reliably fool the discriminator. This is a local equilibrium strategy for the generator—it minimizes the adversarial loss by being very good at producing a few types of samples rather than mediocre at producing all of them. The minimax formulation provides no penalty for missing modes, only for producing unconvincing ones.

Several stabilization techniques target these dynamics directly. Spectral normalization constrains the discriminator's Lipschitz constant, dampening the oscillatory modes. Two-timescale update rules give the discriminator more optimization steps per generator update, approximating the inner-loop optimality assumption of the theoretical analysis. Gradient penalties regularize the discriminator's behavior in interpolated regions between real and fake samples. Each of these interventions addresses a specific pathology of the underlying optimization geometry, but none eliminates the fundamental tension: GANs require solving a non-convex saddle-point problem, and no general convergence guarantees exist for such problems.

Takeaway

GAN training is a non-convex saddle-point problem, and the Nash equilibrium it seeks is surrounded by rotational dynamics that resist convergence. Stabilization tricks work because they dampen specific pathological modes, not because they resolve the underlying mathematical difficulty.

Wasserstein Distance Alternative

The Wasserstein-1 distance, also called the Earth Mover's distance, offers a fundamentally different way to measure distributional discrepancy. Rather than comparing density ratios, it asks: what is the minimum cost of transporting mass from one distribution to the other? Formally, W₁(p_data, p_g) = inf_{γ ∈ Π(p_data, p_g)} 𝔼_(x,y)∼γ[‖x − y‖], where the infimum ranges over all joint distributions with the correct marginals. This is the Kantorovich formulation of optimal transport.

The critical advantage of the Wasserstein distance is its behavior when distributions have non-overlapping supports. Consider two point masses separated by distance d. The JS divergence is log 2 regardless of d—the divergence sees only that the supports are disjoint. The Wasserstein distance, by contrast, equals d itself. It varies smoothly with the geometric distance between distributions, providing informative gradients everywhere in parameter space, including the early training regime where the generator's output is far from the data manifold.

Computing the Wasserstein distance directly is intractable for high-dimensional distributions, but the Kantorovich-Rubinstein duality provides a workable alternative. Under this duality, W₁(p_data, p_g) = sup_{‖f‖_L ≤ 1} 𝔼ₓ∼p_data[f(x)] − 𝔼ₓ∼p_g[f(x)], where the supremum is over all 1-Lipschitz functions. The discriminator (now called the critic) is trained to approximate this supremum, and the Lipschitz constraint replaces the sigmoid-bounded classification of the original GAN.

Enforcing the Lipschitz constraint is the central technical challenge of the Wasserstein GAN (WGAN). The original approach of weight clipping is crude and can bias the critic toward simple functions. Gradient penalty methods, introduced in WGAN-GP, regularize the critic by penalizing deviations from unit gradient norm along interpolations between real and fake samples. Spectral normalization offers an alternative by directly constraining the operator norm of each layer. Both approaches approximate the Lipschitz condition rather than enforcing it exactly, and the quality of this approximation significantly affects training stability.

The deeper lesson of the Wasserstein formulation extends beyond GANs. It demonstrates that the choice of divergence measure determines the topology of the optimization landscape. The JS divergence induces a topology in which nearby distributions (in the geometric sense) can appear maximally distant. The Wasserstein distance respects the underlying metric structure of the sample space, and this alignment between geometry and optimization is what produces better gradients. Optimal transport provides not just a better loss function but a better way of thinking about distributional comparison in generative modeling.

Takeaway

The Wasserstein distance succeeds where JS divergence fails because it respects the geometry of the sample space. Choosing a divergence measure is choosing the topology of your optimization landscape—and that choice determines whether gradients point somewhere useful or vanish into flatness.

The mathematical arc of GANs traces a recurring theme in machine learning theory: the loss function is not merely a training signal but a geometric statement about what similarity means. The original GAN's implicit minimization of Jensen-Shannon divergence works beautifully when distributions overlap and breaks down exactly when they don't.

The instability of GAN training is not a failure of engineering but a consequence of non-convex minimax optimization and the rotational dynamics inherent to adversarial games. Stabilization methods are geometric interventions—constraining Lipschitz constants, adjusting timescales, regularizing gradient fields.

The shift to Wasserstein distance illustrates a broader principle: when optimization fails, question the metric, not just the optimizer. Optimal transport succeeded because it aligned the mathematical structure of the objective with the geometric structure of the problem. This insight extends far beyond GANs to any setting where we compare, interpolate, or generate probability distributions.