Why Adam Works: Adaptive Learning Rates Explained

5 min read

Adam maintains exponential moving averages of gradients and squared gradients, using their ratio to compute per-parameter learning rates automatically.

The first moment provides momentum-like acceleration, while the second moment normalizes updates to prevent large-gradient parameters from dominating.

Bias correction terms compensate for zero initialization, ensuring proper scaling during critical early training steps.

Convergence is guaranteed for convex problems under standard assumptions, but Adam can diverge when gradient statistics become non-stationary.

Understanding Adam's mathematical foundations reveals when it excels and when alternatives like AMSGrad or tuned SGD may perform better.

Deep learning optimization faces a fundamental tension. Gradient descent requires a learning rate, but no single rate works everywhere. Too large, and you overshoot minima. Too small, and training crawls. Worse, different parameters need different rates—sparse gradients demand aggressive steps while dense gradients require restraint.

The Adam optimizer, introduced by Kingma and Ba in 2014, addressed this through adaptive moment estimation. Rather than applying uniform updates, Adam maintains per-parameter learning rates that automatically adjust based on gradient history. It combines momentum's acceleration with RMSprop's adaptive scaling, creating an optimizer that handles diverse optimization landscapes remarkably well.

Understanding Adam requires engaging with its mathematical foundations. The algorithm tracks exponential moving averages of gradients and squared gradients, uses these to normalize updates, and applies bias corrections that prove essential for convergence. Each component serves a precise purpose, and their interaction produces the robustness that made Adam the default optimizer for much of modern deep learning.

Moment Estimation: The Dual Tracking System

Adam maintains two exponential moving averages at each step. The first moment estimate tracks the mean of gradients, while the second moment estimate tracks the uncentered variance. These aren't arbitrary choices—they capture complementary information about the optimization landscape.

The first moment update follows the rule: m_t = β₁ · m_{t-1} + (1 - β₁) · g_t, where g_t is the current gradient and β₁ typically equals 0.9. This is momentum—it smooths gradient estimates by weighting recent history, reducing noise and accelerating progress along consistent directions.

The second moment update uses: v_t = β₂ · v_{t-1} + (1 - β₂) · g_t², with β₂ typically 0.999. This tracks gradient magnitude, measuring how large updates have been for each parameter. Parameters receiving consistently large gradients accumulate high second moment estimates.

The final update divides the first moment by the square root of the second moment: θ_t = θ_{t-1} - α · m_t / (√v_t + ε). This normalization is Adam's key insight. Parameters with large gradients get their updates scaled down, while parameters with small gradients get proportionally larger steps.

The mathematical elegance lies in how this handles different regimes. Sparse features that rarely appear receive large updates when they do—their second moment remains small, so the denominator doesn't suppress their learning. Dense features appearing constantly have accumulated second moments that automatically reduce their effective learning rate, preventing oscillation.

Takeaway
Adam's power comes from tracking two distinct statistics—gradient direction and gradient magnitude—then using their ratio to automatically calibrate per-parameter learning rates.

Bias Correction Necessity: Fixing the Cold Start

A subtle but critical issue arises from initialization. Both moment estimates start at zero: m₀ = 0 and v₀ = 0. In early training, this biases estimates toward zero, causing problems that the original RMSprop algorithm ignored.

Consider the first step. After one gradient g₁, the first moment becomes m₁ = (1 - β₁) · g₁ = 0.1 · g₁. We've scaled the gradient by 0.1 simply because of initialization, not because of any meaningful signal. With β₂ = 0.999, the second moment is even more suppressed: v₁ = 0.001 · g₁².

The bias correction terms restore proper scaling. The corrected estimates are: m̂_t = m_t / (1 - β₁ᵗ) and v̂_t = v_t / (1 - β₂ᵗ). At t=1, the first moment correction divides by (1 - 0.9) = 0.1, exactly compensating for the initial suppression.

Mathematically, this correction ensures the expected value of the estimates equals the expected value of the true moments. Without correction, E[m_t] = (1 - β₁ᵗ) · E[g], which underestimates by exactly the factor we're correcting. The derivation follows from the geometric series expansion of the exponential moving average.

As t grows large, (1 - β₁ᵗ) approaches 1, and the correction becomes negligible. But in early training—precisely when optimization decisions matter most—the correction prevents artificially suppressed or inflated steps. This seemingly minor fix often makes the difference between successful training and divergence.

Takeaway
Exponential moving averages initialized at zero systematically underestimate true moments in early training; bias correction compensates exactly for this initialization artifact.

Convergence Properties: When Adam Succeeds and Struggles

Adam's convergence guarantees are more nuanced than often appreciated. The original paper proved convergence for convex objectives under specific assumptions, but subsequent work revealed surprising failure modes that practitioners should understand.

For convex optimization with bounded gradients, Adam achieves regret bounds of O(√T), matching the best possible rates for online convex optimization. The adaptive learning rates don't hurt worst-case performance while providing practical benefits on structured problems.

However, Reddi et al. (2018) demonstrated that Adam can diverge on simple convex problems. The issue stems from the exponential moving average of second moments: past gradient magnitudes can dominate current estimates, preventing necessary adaptation. They proposed AMSGrad, which maintains the maximum of past second moment estimates, restoring convergence guarantees.

In practice, Adam struggles with several identifiable patterns. Highly non-stationary objectives where the loss landscape changes dramatically can confuse moment estimates trained on outdated information. Sharp, narrow valleys may see the second moment estimate grow large, suppressing updates precisely when aggressive moves are needed.

Adam also shows sensitivity to hyperparameters in ways that contradict its "works out of the box" reputation. The default β₂ = 0.999 provides long memory for second moments, which stabilizes training but can prevent adaptation to sudden landscape changes. Reducing β₂ toward 0.99 helps in some settings. Understanding these failure modes transforms Adam from a black-box default to a tool whose behavior you can predict and tune.

Takeaway
Adam provides strong practical performance and theoretical guarantees for convex problems, but its reliance on gradient history means it can struggle when optimization landscapes shift faster than its moment estimates can track.

Adam's success stems from principled design rather than empirical accident. Each component—momentum for acceleration, adaptive scaling for per-parameter rates, bias correction for proper initialization—addresses a specific optimization challenge. Their combination handles the heterogeneous landscapes of deep learning remarkably well.

Understanding the mathematics reveals when to trust Adam and when to look elsewhere. Stationary problems with diverse parameter scales suit Adam perfectly. Highly non-stationary objectives or those requiring rapid adaptation may benefit from variants like AdamW, AMSGrad, or carefully tuned SGD with momentum.

The deeper lesson extends beyond any single optimizer. Effective optimization requires matching algorithm properties to problem structure. Adam encodes particular assumptions about gradient statistics—assumptions that hold broadly but not universally. Knowing what those assumptions are lets you predict when they'll break.