Volatility forecasting remains one of the most consequential problems in quantitative finance, yet the gap between theoretical elegance and practical predictive power has never been wider. For decades, the GARCH family of models served as the workhorse of conditional variance estimation, providing a parsimonious framework that captured key stylized facts—volatility clustering, mean reversion, and leverage effects. These models earned their place in institutional risk management systems and remain embedded in regulatory frameworks like Basel's internal models approach.
But the landscape has shifted dramatically. The proliferation of high-frequency data has enabled realized volatility measures that bypass the latent-variable problem entirely, offering ex-post measurements of quadratic variation with unprecedented precision. Simultaneously, deep learning architectures—particularly Long Short-Term Memory networks—have entered the volatility forecasting arena, promising nonlinear pattern recognition that econometric models cannot replicate by construction.
The practitioner now faces a genuine model selection dilemma. Each approach embeds fundamentally different assumptions about the data-generating process, demands different infrastructure, and fails in different regimes. This article develops a comparative framework across three paradigms—parametric GARCH specifications, realized volatility estimators, and neural network architectures—evaluating not just point forecast accuracy but the entire predictive distribution. The goal is not to crown a winner but to articulate when each approach delivers genuine informational value for risk management and options trading, and when its apparent superiority is merely an artifact of evaluation methodology.
GARCH Family Models: Parsimonious Power and Its Boundaries
The standard GARCH(1,1) specification remains remarkably difficult to beat in many empirical settings, a finding that has persisted since Bollerslev's original 1986 formulation. The model's persistence parameter α + β governs the rate of mean reversion in conditional variance, and for most equity indices this sum hovers near 0.98, implying a half-life of volatility shocks measured in weeks. This single structural insight—that volatility is highly persistent but ultimately mean-reverting—captures the dominant feature of return dynamics and explains why GARCH(1,1) frequently outperforms more complex specifications in out-of-sample forecasting.
Yet the symmetric response embedded in standard GARCH is empirically inadequate. The leverage effect—the well-documented asymmetry whereby negative returns produce larger subsequent volatility than positive returns of equal magnitude—requires extensions. The EGARCH model of Nelson (1991) handles this through an asymmetric response function in log-variance space, offering the additional advantage of ensuring positivity without parameter constraints. The GJR-GARCH model achieves similar asymmetry through an indicator function on negative innovations. Specification testing between these alternatives demands careful application of likelihood ratio tests, information criteria, and—critically—out-of-sample evaluation, since in-sample fit improvements from added parameters frequently vanish when forecasting.
Estimation itself introduces subtle challenges that practitioners often underestimate. Maximum likelihood under the assumption of Gaussian innovations produces quasi-maximum likelihood estimators that remain consistent even when the true innovation distribution is non-normal, but efficiency losses can be substantial when tail behavior departs significantly from Gaussianity. Student-t or GED innovations improve efficiency and are standard in modern implementations, but the choice of error distribution directly affects Value-at-Risk and Expected Shortfall calculations—a point where model risk becomes operationally consequential.
The multivariate extension of GARCH models introduces the curse of dimensionality in its most acute form. A full BEKK specification for even 50 assets requires estimating thousands of parameters, rendering direct estimation infeasible. The DCC (Dynamic Conditional Correlation) framework of Engle (2002) addresses this through a two-step procedure—univariate GARCH estimation followed by correlation dynamics—but the computational convenience comes at the cost of efficiency loss and the assumption that correlation dynamics share a common structure across all pairs.
Perhaps the most important limitation of GARCH models for institutional applications is their information set. These models condition exclusively on past returns, ignoring the rich information embedded in options markets, high-frequency data, and cross-sectional signals. The implied volatility surface contains forward-looking information that GARCH cannot incorporate by construction, creating a systematic informational disadvantage relative to approaches that exploit broader data sources. This is not a failure of the framework per se—it is a design boundary that practitioners must understand when positioning GARCH forecasts within a broader risk management architecture.
TakeawayGARCH models earn their enduring relevance not through sophistication but through parsimony—the discipline of extracting maximum forecasting power from minimal assumptions. Before reaching for complexity, exhaust what a well-specified GARCH(1,1) with asymmetric extensions can deliver.
Realized Volatility: When High-Frequency Data Changes the Game
The fundamental innovation of realized volatility is conceptual before it is statistical: it transforms volatility from a latent variable requiring model-based extraction into a directly observable quantity. By summing squared intraday returns at sufficiently high frequency, we obtain a consistent estimator of the integrated variance over any given interval. This shift—from filtering to measurement—has profound implications for both forecasting and model evaluation, because we can now assess forecast accuracy against a precise benchmark rather than against another model's estimate.
The canonical realized variance estimator, computed as the sum of squared five-minute returns, represents a pragmatic compromise in the bias-variance tradeoff introduced by market microstructure noise. At tick frequency, bid-ask bounce and other microstructure effects inflate the estimator dramatically. At lower frequencies, we sacrifice information and precision. The five-minute convention, while widely adopted, is not optimal in any formal sense. Kernel-based estimators such as the realized kernel of Barndorff-Nielsen, Hansen, Lunde, and Shephard (2008) and the pre-averaging approach of Jacod, Li, Mykland, and Zhang (2009) provide noise-robust alternatives that achieve better rates of convergence while remaining operational at the highest available frequencies.
The decomposition of realized variance into continuous and jump components represents a further analytical refinement with direct risk management applications. Bipower variation, which replaces squared returns with products of adjacent absolute returns, converges to integrated variance even in the presence of jumps, enabling the separation of diffusive and discontinuous price movements. Empirical evidence consistently shows that the jump component, while contributing meaningfully to total variation, has limited predictive power for future volatility—it is the continuous component that drives forecastability. This finding has important implications for options pricing models that treat jumps as a separate risk factor.
The HAR (Heterogeneous Autoregressive) model of Corsi (2009) has become the benchmark forecasting framework in the realized volatility literature. By regressing future realized volatility on daily, weekly, and monthly components of past realized volatility, the HAR model captures the long-memory property of volatility through a simple cascade structure inspired by the heterogeneous market hypothesis. Despite its apparent simplicity—it is a restricted AR model—it consistently delivers competitive or superior forecasting performance relative to far more complex specifications, including fractionally integrated models and affine stochastic volatility models.
The infrastructure requirements for realized volatility computation are non-trivial and represent a genuine barrier to adoption outside of well-resourced institutions. Reliable high-frequency data with accurate timestamps, proper handling of overnight returns, exchange holidays, and early closes, and robust procedures for outlier detection and cleaning are prerequisites. For less liquid assets—corporate bonds, emerging market equities, many commodity contracts—the data simply does not support meaningful intraday measurement, constraining the applicability of these methods to the most actively traded instruments. This is not a minor caveat; it defines the practical boundary of the entire paradigm.
TakeawayRealized volatility's deepest contribution is epistemological—it lets us measure what we previously had to model. But measurement precision means nothing without data quality, and the infrastructure demands mean this approach rewards institutional commitment, not casual implementation.
ML Forecasting: Neural Networks as Nonlinear Volatility Filters
Long Short-Term Memory networks have attracted substantial attention in volatility forecasting precisely because their architecture addresses a structural limitation of traditional econometric models: the inability to capture arbitrary nonlinear dependencies in sequential data. The gating mechanism in LSTM cells—input, forget, and output gates—allows the network to selectively retain or discard information across long time horizons, making it theoretically well-suited to the persistent, regime-dependent dynamics that characterize financial volatility. Empirical studies, including Kim and Won (2018) and Bucci (2020), have documented meaningful improvements over HAR and GARCH benchmarks, particularly during high-volatility episodes.
However, the apparent superiority of neural network forecasts often dissolves under rigorous backtesting. The critical issue is look-ahead bias in hyperparameter selection. Many published studies tune network architecture, learning rates, dropout rates, and training window lengths using information that would not have been available in real time. A proper expanding-window or rolling-window evaluation protocol, where all model decisions—including architecture selection—are made using only past data, frequently narrows or eliminates the performance gap relative to well-specified econometric alternatives. The model complexity that enables nonlinear fitting also enables overfitting, and distinguishing between the two requires discipline in evaluation design.
The interpretability deficit of neural network volatility forecasts presents a genuine operational challenge in institutional settings. When a GARCH model's conditional variance estimate spikes, the source is traceable to specific return innovations and their propagation through estimated parameters. When an LSTM forecast spikes, the causal mechanism is opaque—distributed across thousands of weights with no closed-form expression linking inputs to outputs. For risk management applications where regulatory scrutiny demands model explainability, and for trading applications where understanding why a forecast changed is essential to position sizing, this opacity carries real costs that pure forecast accuracy metrics do not capture.
Hybrid architectures offer a pragmatic synthesis. Feeding realized volatility components, GARCH residuals, or implied volatility features as engineered inputs to neural networks leverages domain knowledge while allowing the network to discover nonlinear interactions that econometric specifications miss. The HAR-LSTM approach—using HAR components as input features rather than raw returns—consistently outperforms both pure HAR and pure LSTM specifications in recent comparative studies. This approach respects the principle that financial time series exhibit known structural regularities that should be encoded, not rediscovered, by the learning algorithm.
The deployment pipeline for neural network volatility models requires infrastructure that extends well beyond the training loop. Model monitoring—detecting concept drift, data distribution shifts, and silent failures—is essential because neural networks degrade unpredictably when the data-generating process changes, as it inevitably does during regime transitions. Ensemble methods that combine neural network forecasts with econometric baselines provide robustness against individual model failure while capturing the complementary strengths of each paradigm. The most sophisticated institutional implementations treat the forecasting problem not as model selection but as model combination, allocating weight dynamically based on recent predictive performance.
TakeawayNeural networks don't replace financial theory—they extend it. The most effective ML volatility models are those that encode known structure as features and let the network discover what remains, not those that attempt to learn everything from raw data.
The volatility forecasting landscape is not a linear progression from simple to complex—it is a portfolio of complementary approaches, each with distinct informational advantages and operational constraints. GARCH models deliver robust, interpretable baselines requiring minimal infrastructure. Realized volatility methods convert high-frequency data into precise measurements that sharpen both forecasts and forecast evaluation. Neural networks capture nonlinear dynamics that parametric models miss by construction.
The practical insight for institutional implementation is that forecast combination dominates model selection. Allocating weight across paradigms based on recent predictive performance—and adjusting that allocation as market regimes shift—delivers more stable risk management outcomes than commitment to any single methodology. The infrastructure investment is substantial, but the diversification benefit is real and measurable.
Ultimately, the sophistication of the forecasting model matters less than the rigor of the evaluation framework. A mediocre model with honest backtesting outperforms a brilliant model validated on contaminated results. The discipline is in the testing, not the architecture.