PAC Learning: When Machine Learning Has Guarantees

Image by National Cancer Institute on Unsplash

7 min read

PAC learning formalizes learnability by requiring algorithms to produce approximately correct hypotheses with high probability for any data distribution.

Sample complexity for finite hypothesis classes scales logarithmically with class size, while infinite classes require finite VC dimension for polynomial sample bounds.

The fundamental theorem of statistical learning establishes that VC dimension exactly characterizes PAC learnability for infinite hypothesis classes.

Computational tractability introduces a separate barrier—some problems with polynomial sample complexity have no known efficient algorithms.

Cryptographic hardness assumptions provide tools to prove computational lower bounds, revealing fundamental gaps between statistical and computational efficiency.

Machine learning algorithms produce models that seem to work—until they don't. We train on data, test on held-out samples, and hope generalization holds. But hope is not a foundation for theory. The probably approximately correct (PAC) framework replaces intuition with mathematical guarantees, answering a question that haunted early researchers: when can we prove that learning is possible?

Leslie Valiant introduced PAC learning in 1984, establishing the first rigorous framework for computational learning theory. The brilliance lies in accepting two types of uncertainty explicitly. We cannot guarantee perfect accuracy—hence approximately correct. We cannot guarantee success every time—hence probably. By parameterizing both failure modes, PAC theory transforms vague notions of learnability into precise mathematical statements with quantifiable sample requirements.

This framework does more than satisfy theoretical curiosity. It reveals fundamental limits that no algorithm can overcome, regardless of computational resources or clever engineering. It separates problems that are learnable in principle from those learnable in practice. And it provides the conceptual vocabulary—VC dimension, sample complexity, computational tractability—that underpins modern understanding of generalization. For researchers pushing algorithmic boundaries, PAC theory offers both the tools to analyze new methods and the humility to recognize when problems are provably hard.

Learnability Definitions: Formalizing the Learning Contract

A concept class is PAC-learnable if there exists an algorithm that, given enough examples, produces a hypothesis close to the true concept with high probability. Formally, for any target concept c from class C, any distribution D over the instance space, and any accuracy parameter ε and confidence parameter δ, the learner must output hypothesis h such that the probability of misclassification under D is at most ε, with probability at least 1-δ over the random draw of training examples.

The definition's power comes from its distribution-free nature. The guarantee must hold for any distribution the adversary chooses—not just well-behaved ones. This adversarial framing explains why PAC bounds often seem pessimistic compared to empirical performance. Real-world distributions have structure that benevolent nature provides but malicious adversaries would deny. The gap between worst-case theory and average-case practice remains a central tension in learning theory.

Two parameters govern the learning contract. The accuracy parameter ε bounds generalization error—how wrong the learned hypothesis can be on unseen data drawn from the same distribution. The confidence parameter δ bounds the probability that learning fails entirely, producing a hypothesis worse than the ε guarantee. Crucially, the sample complexity must be polynomial in 1/ε and 1/δ for efficient learnability. Requiring exponentially many samples renders learning impractical even if theoretically possible.

The framework distinguishes proper from improper learning. Proper learners must output hypotheses from the same concept class they're trying to learn—if learning decision trees, output a decision tree. Improper learners may output any efficiently evaluable hypothesis, potentially from a more expressive class. This distinction matters profoundly: some concept classes are improperly learnable but not properly learnable, revealing that representational constraints can obstruct learning even when the target is representable.

Realizable versus agnostic settings capture another crucial distinction. Realizable PAC learning assumes the target concept lies exactly in the hypothesis class—a strong assumption rarely satisfied in practice. Agnostic PAC learning drops this assumption, requiring only that the learner compete with the best hypothesis in the class, even if that best hypothesis has nonzero error. Agnostic learning is strictly harder, with different sample complexity bounds that depend on uniform convergence over the hypothesis class.

Takeaway
PAC learnability requires that for any distribution and any desired accuracy, a polynomial number of samples suffices to probably find an approximately correct hypothesis—making the learning guarantee both distribution-free and computationally meaningful.

Sample Complexity Bounds: Counting the Cost of Certainty

For finite hypothesis classes, sample complexity admits elegant closed-form bounds. To achieve error at most ε with probability at least 1-δ, it suffices to draw m ≥ (1/ε)(ln|H| + ln(1/δ)) samples. The logarithmic dependence on hypothesis class size |H| is remarkable—doubling the number of hypotheses adds only one more sample's worth of complexity. This explains why massive ensemble methods remain practical despite considering combinatorially many base learners.

The derivation follows from union bounds and Hoeffding's inequality. Each hypothesis has some probability of being consistent with the sample yet having high true error—a bad hypothesis. For any single bad hypothesis, the probability it survives m samples decreases exponentially: at most (1-ε)^m. Union bounding over all hypotheses gives |H|(1-ε)^m as the probability any bad hypothesis survives. Setting this below δ and solving yields the sample complexity bound.

Infinite hypothesis classes require the VC dimension—the maximum number of points that can be shattered (labeled in all 2^n possible ways) by hypotheses from the class. The fundamental theorem of statistical learning states that a class is PAC-learnable if and only if it has finite VC dimension. Sample complexity scales as O((d/ε)log(1/ε) + (1/ε)log(1/δ)) where d is the VC dimension, replacing the logarithm of class size with a combinatorial measure of complexity.

Lower bounds prove these dependencies are essentially tight. Any algorithm learning a class of VC dimension d requires Ω(d/ε + log(1/δ)/ε) samples. The matching upper and lower bounds mean that sample complexity is characterized by VC dimension—no clever algorithm can fundamentally improve on these rates. This is a profound statement: geometry of the hypothesis class, not algorithmic ingenuity, determines sample requirements.

Rademacher complexity provides distribution-dependent alternatives to worst-case VC analysis. By measuring how well the hypothesis class correlates with random noise on the actual data distribution, Rademacher bounds can be tighter when the distribution has favorable structure. This connects PAC theory to empirical risk minimization: generalization bounds decompose into empirical error plus a complexity penalty, formalizing the bias-variance tradeoff with precise constants.

Takeaway
VC dimension characterizes learnability for infinite hypothesis classes—sample complexity scales linearly with VC dimension, and no algorithm can fundamentally beat these bounds, making hypothesis class geometry the true determinant of learning difficulty.

Computational Tractability: When Information Isn't Enough

Information-theoretic learnability asks whether enough samples exist to learn; computational learnability asks whether we can find the good hypothesis efficiently. These questions have different answers. Some concept classes have polynomial sample complexity but no known polynomial-time algorithm—they are learnable in principle but not in practice. This gap between statistical and computational efficiency defines much of modern learning theory's frontier.

The class of DNF formulas (disjunctions of conjunctions) exemplifies this separation. DNF has polynomial VC dimension, so polynomially many samples suffice for learning. Yet the best known proper learning algorithms run in quasi-polynomial time, and under standard cryptographic assumptions, no polynomial-time algorithm exists. Learning DNF properly is believed to be computationally hard despite being information-theoretically easy.

Cryptographic hardness assumptions provide the sharpest tools for proving computational lower bounds. If learning some concept class were easy, we could break certain encryption schemes. Since we believe those schemes secure, learning must be hard. This methodology transformed computational learning theory from a collection of upper bounds into a field with genuine impossibility results. The learning parity with noise problem, believed hard, underlies many negative results.

Improper learning sometimes rescues computational tractability. While properly learning DNF is hard, Linial, Mansour, and Nisan showed that DNF can be improperly learned in polynomial time using low-degree polynomial threshold functions. The learned hypothesis doesn't look like a DNF but achieves low error. This result sparked decades of work on representation-independent learning, where we seek the simplest representation that suffices for accurate prediction regardless of the target's form.

The statistical query model captures algorithms that access the distribution only through noisy estimates of expectations. Any statistical query algorithm can be simulated from samples with polynomial overhead, but some problems learnable from samples are provably hard in the statistical query model. This hierarchy—samples are strictly more powerful than statistical queries under certain assumptions—illuminates which algorithmic techniques have fundamental limits and which might be circumvented with cleverer sample use.

Takeaway
Computational and information-theoretic learnability are distinct—some problems have polynomial sample complexity but no efficient algorithm, and proving computational hardness often requires cryptographic assumptions that formalize the gap between knowing enough and computing fast enough.

PAC learning provides the mathematical foundation for asking—and answering—when machine learning can provide guarantees. It separates wishful thinking from provable statements, replacing hope with theorems. The framework's parameterization of accuracy and confidence makes abstract learnability questions concrete and quantifiable.

Sample complexity bounds reveal that hypothesis class geometry, captured by VC dimension, determines how much data learning requires. These bounds are tight: no algorithm escapes the fundamental limits imposed by the class's combinatorial structure. Yet information-theoretic learnability is only half the story—computational tractability introduces a second barrier that can render problems practically unsolvable despite theoretical learnability.

For researchers developing new algorithms, PAC theory offers both validation and warning. It provides tools to prove your method works with guarantees, but also reveals when you're fighting impossibility. Understanding these foundations isn't mere abstraction—it's the difference between pursuing hard problems and pursuing provably unsolvable ones.