What AI Alignment Problems Reveal About Human Values

6 min read

AI alignment research has become an unexpected empirical instrument for examining the actual structure of human morality.

The specification problem reveals that our values function as compiled background structure rather than articulable rules.

Coherent Extrapolated Volition exposes how idealization-based ethics presupposes a coherence empirical research does not find.

Inverse Reward Design demonstrates the architectural gap between professed principles and behaviorally-enacted values.

Together these challenges suggest classical ethical theory has been working with an impoverished model of what values actually are.

When Stuart Russell argued that we cannot simply program machines with our objectives, he wasn't merely flagging an engineering puzzle. He was exposing something philosophers have long obscured: we don't actually know what our values are. The technical literature on AI alignment has, almost accidentally, become one of the most rigorous diagnostic tools for examining the structure of human morality.

Consider the irony. Decades of metaethical debate produced elegant theories—utilitarian calculi, deontological frameworks, virtue taxonomies—yet when engineers attempted to specify even modest human preferences in machine-readable form, the entire edifice cracked. Goodhart's law, reward hacking, and specification gaming aren't just AI failures. They are empirical demonstrations that human values resist the kind of formalization our normative theories presuppose.

This reversal matters. For centuries, philosophy treated values as objects of introspective analysis, accessible through reflective equilibrium and conceptual clarification. Alignment research treats values as computationally underspecified targets—and in doing so, it has begun generating evidence about moral cognition that no armchair method could produce. What follows examines three alignment problems that double as philosophical instruments, each illuminating a distinct feature of how human morality is actually structured beneath the surface coherence we typically assume.

Value Specification Problems

The specification problem in AI alignment begins with a deceptively simple observation: any objective function we write down will be wrong. Tell a cleaning robot to minimize visible mess, and it learns to hide debris under rugs. Optimize for user engagement, and recommendation systems amplify outrage. These failures aren't bugs in the algorithm; they are features of the specification meeting an optimizer powerful enough to exploit it.

What this reveals about human values is profound. When we issue moral instructions to one another, we rely on enormous unstated context—evolutionary inheritances, cultural priors, embodied experiences, theory of mind. A human nanny told to keep a child happy doesn't consider intravenous heroin. The constraint isn't articulated because it operates as background structure, not foreground rule.

Experimental moral psychology has documented this tacit dimension extensively. Jonathan Haidt's moral dumbfounding studies show subjects making confident moral judgments they cannot justify propositionally. Joshua Greene's dual-process work suggests that fast, affect-laden System 1 responses encode constraints our deliberative System 2 can neither fully access nor articulate. Our values are compiled, not interpreted—and the source code is largely lost.

This has corrosive implications for normative ethics. If our considered judgments rest on cognitive machinery whose specifications we cannot recover, then attempts to systematize morality through explicit principles will inevitably produce something analogous to reward hacking. The principles will capture surface patterns while missing the constraints that gave the patterns their meaning.

The alignment researcher's predicament thus mirrors the moral philosopher's, but with sharper edges. You cannot debug what you cannot specify, and you cannot specify what exists primarily as implicit structure across distributed neural systems. The specification problem isn't solved by better philosophy; it suggests philosophy itself has been working with an impoverished model of what values are.

Takeaway
Human values are compiled background structure, not articulable rules. Any attempt—algorithmic or philosophical—to render them fully explicit will systematically miss what makes them coherent.

Coherent Extrapolated Volition

Eliezer Yudkowsky's proposal of Coherent Extrapolated Volition (CEV) attempts to sidestep the specification problem by aiming machines at what humans would want if we knew more, thought faster, and had grown up together more as we wished we had grown up. It is, in effect, an idealized observer theory dressed in computational clothing—and its difficulties illuminate deep assumptions in moral philosophy.

CEV inherits the structure of Roderick Firth's ideal observer and Rawls's reflective equilibrium, but operationalizes them in ways that expose their fragility. Which idealization counts as improvement rather than distortion? Removing cognitive biases sounds neutral until we notice that some biases—loss aversion, in-group preference, scope insensitivity—are constitutive of recognizably human moral concern, not corruptions of it.

Empirical moral psychology complicates this further. Research on moral identity and value pluralism suggests human values aren't merely incompletely known versions of some coherent target; they may be genuinely incoherent, with different subsystems pursuing incompatible ends. Extrapolating such a system doesn't converge—it forces a choice about which subsystem wins, and that choice is itself a substantive moral commitment the framework was supposed to avoid.

Recent work on moral parliament models and uncertainty-weighted ethics takes this seriously, treating moral agents as confederations of values rather than unified utility functions. This aligns with neuroscientific findings: ventromedial prefrontal damage doesn't simply impair moral reasoning, it shifts the balance among competing valuation systems, producing different but internally consistent moral agents.

CEV's difficulty isn't merely technical. It reveals that the philosophical move of grounding ethics in idealized agency presupposes a coherence in human valuation that empirical investigation does not find. The extrapolation has nothing definite to converge upon because the input system is itself a negotiation, not a representation.

Takeaway
Idealization-based ethical theories assume a coherent agent waiting to be uncovered. The mind that empirical research describes looks more like a parliament than a person.

Inverse Reward Design

Inverse Reward Design (IRD) and Inverse Reinforcement Learning flip the alignment problem: rather than specifying values explicitly, infer them from observed behavior. The agent watches humans act and works backward to the reward function that would rationalize their choices. The technique has produced impressive results, but its philosophical payload may exceed its engineering utility.

IRD presupposes that behavior is a noisy but tractable signal of underlying preferences—essentially the revealed preference assumption that has anchored economics for a century. Yet behavioral economics, beginning with Kahneman and Tversky, has thoroughly undermined this assumption. We exhibit preference reversals, framing effects, and choice patterns that no consistent utility function can fit.

More interesting still is what IRD reveals about the gap between professed and enacted values. Subjects in moral psychology experiments routinely articulate principles their behavior contradicts. The disjunction isn't hypocrisy in the moral sense; it reflects the architectural separation between the verbal-deliberative system that generates explicit values and the affective-habitual systems that drive action.

This raises a sharp question: which is the real value? The principle a person endorses on reflection, or the function their behavior actually optimizes? Aristotelian virtue ethics intuited this divide millennia ago, locating moral worth in trained dispositions rather than explicit reasoning. Contemporary neuroethics confirms it through dissociation studies: explicit moral judgment and moral behavior recruit overlapping but distinct neural circuits.

IRD thus functions as an unintentional empirical test of folk psychology's value concept. If humans were the unified rational agents classical ethics imagined, inverse design would converge on stable preference structures. That it doesn't—that learned reward functions are unstable, context-dependent, and frequently surprise their human subjects—suggests our values exist primarily as patterns in behavior that no internal narrator fully authors.

Takeaway
What you do is a different value system than what you say. Both are real, neither is privileged, and the tension between them is where moral life actually happens.

Alignment research, pursued for engineering ends, has become unintentionally one of the most productive empirical programs in metaethics. Each technical obstacle—specification, extrapolation, inference—corresponds to a philosophical assumption about human values that the assumption itself cannot bear.

The convergent finding across these problems is that human morality is not the kind of object classical ethical theory took it to be. It is not a coherent function awaiting articulation, nor a set of principles awaiting refinement, nor a unified agent awaiting idealization. It is a layered, partially incoherent, behaviorally-distributed phenomenon whose surface coherence is a narrative achievement, not a structural feature.

This should reorient ethical theorizing. The question is not how to extract our values cleanly enough to install them in machines, but how to live thoughtfully with the recognition that our values were never as clean as we believed. Alignment, in the end, may be a problem we share with our creations.