The field of AI alignment rests upon a seemingly unassailable premise: that we can specify what humans want, and then construct systems that pursue those objectives faithfully. Billions of dollars and countless research hours flow toward this goal, treating human values as a fixed target awaiting precise characterization. Yet beneath this entire enterprise lies a question so fundamental it threatens to dissolve the problem as currently conceived—do humans actually possess the coherent, stable preferences that alignment research presupposes?

Consider the strange asymmetry at the heart of alignment discourse. We worry obsessively about whether AI systems will faithfully pursue human values, while largely taking for granted that we know what those values are. This assumption would strike any serious moral philosopher as extraordinary. Centuries of ethical inquiry have failed to produce consensus on fundamental questions of value, and cognitive science increasingly reveals that individual human preferences are far less stable and coherent than our introspective reports suggest.

What emerges when we interrogate this foundational assumption is not merely a technical complication for alignment research, but a reconceptualization of the entire problem space. If the 'target' of alignment is not a fixed point but a shifting, context-dependent, partially-constructed phenomenon, then the goal of building AI systems that reliably pursue 'human values' requires rethinking from first principles. The implications extend beyond engineering challenges into deep questions about the nature of agency, preference, and what it might mean for artificial systems to serve beings whose desires are fundamentally in flux.

Value Instability Problem

The assumption that human preferences constitute stable alignment targets collides dramatically with empirical reality. Decades of research in behavioral economics and cognitive psychology reveal that human choices are exquisitely sensitive to factors that, on any coherent theory of value, should be irrelevant. Amos Tversky and Daniel Kahneman's foundational work on framing effects demonstrated that identical outcomes elicit opposite preferences depending on whether they are described as gains or losses. This is not mere confusion—it reflects something structural about human cognition.

The instability runs deeper than framing. Studies consistently show that preferences shift based on physiological states in ways subjects cannot detect or correct for. Hungry judges impose harsher sentences before lunch breaks. Sleep-deprived individuals exhibit markedly different risk preferences than their rested counterparts. Mood states induced by irrelevant stimuli—sunny weather, pleasant music, even the presence of attractive strangers—systematically alter the choices people make. These are not deviations from 'true' preferences; they are the preferences, varying continuously with internal and external conditions.

Temporal instability compounds the problem. The phenomenon of preference reversal—where individuals choose differently depending on temporal proximity to outcomes—is not an occasional glitch but a robust feature of human decision-making. We systematically discount future consequences in hyperbolic rather than exponential patterns, leading to predictable inconsistencies between what we want now and what we will want later. Which temporal self should alignment target?

The construction of preferences through the act of measurement introduces further complications. Research on 'constructed preferences' shows that for many choices, people do not consult pre-existing values but rather build preferences on the spot using whatever information and heuristics are salient. The order of options presented, the default choice, the comparison set available—all systematically shape what people claim to want. Preferences, in this view, are less like fixed quantities waiting to be measured and more like quantum states that crystallize only upon observation.

For alignment research, these findings pose a fundamental challenge. If we attempt to learn human preferences through observation of behavior or stated choices, we inevitably capture not stable underlying values but context-dependent constructions. An AI system trained on such data would learn a moving target that changes based on how questions are framed, when they are asked, and what alternatives are presented. The 'alignment' achieved would be alignment with a particular measurement context, not with any coherent set of human values.

Takeaway

Human preferences are not fixed targets awaiting discovery but context-dependent constructions that shift with framing, mood, and circumstances—meaning alignment research may be aiming at a fundamentally unstable phenomenon.

Coherence Illusion

Despite overwhelming evidence of preference instability, humans maintain a powerful sense of possessing unified, coherent values. This felt coherence itself requires explanation, and cognitive science increasingly suggests it is largely an illusion—a narrative constructed after the fact to impose order on disparate, often contradictory impulses. The implications for alignment are profound: the 'human preferences' we seek to encode may be more confabulation than reality.

The phenomenon of choice blindness elegantly demonstrates this post-hoc construction. In striking experiments, subjects choose between two options—say, photographs of faces they find attractive—and are then asked to explain their choice while being shown the rejected option. Most subjects fail to notice the switch and fluently generate reasons for a choice they never made. This suggests that our explanations of our values and preferences are not reports from some inner sanctum of authentic desire but rather plausible stories generated to explain behavior that has already occurred.

Split-brain research provides even more dramatic evidence. Patients whose corpus callosum has been severed exhibit a revealing dissociation: their left hemisphere confabulates explanations for actions initiated by the right hemisphere, constructing coherent narratives for behaviors whose actual causes it cannot access. While most of us possess intact brains, the underlying architecture may be similar—multiple semi-autonomous systems generating behavior, with a narrative module spinning coherent stories after the fact.

The unity of the self that grounds our intuitive concept of 'what I want' may be more social construction than psychological reality. Cross-cultural research reveals that the Western conception of a unified, autonomous self with stable preferences is neither universal nor inevitable. Many cultures operate with more fluid, relational conceptions of selfhood where values are appropriately context-dependent rather than fixed. Which conception should guide alignment research?

These findings suggest that when we specify 'human values' as alignment targets, we may be reifying a folk psychological construct that does not correspond to any coherent phenomenon. The sense that we have unified preferences that an AI system could faithfully pursue may itself be a cognitive illusion—one that feels compelling precisely because the confabulation is so seamless. Alignment research that takes this illusion at face value risks optimizing for a fiction, building systems that pursue the stories we tell about our values rather than whatever complex, contradictory impulses actually drive human behavior.

Takeaway

Our sense of having unified, coherent values is largely a narrative illusion constructed after the fact—raising the question of whether 'human preferences' as an alignment target corresponds to any stable psychological reality.

Dynamic Value Architecture

If human values are neither stable nor coherent in the ways alignment research presupposes, we require fundamentally different frameworks for thinking about the relationship between AI systems and human flourishing. Rather than treating alignment as a targeting problem—finding and hitting a fixed objective—we might reconceive it as a relational problem involving ongoing negotiation between dynamic systems.

One promising direction draws from the philosophy of procedural rather than substantive values. Instead of attempting to specify what humans want, we might focus on processes that humans find valuable: being consulted, having options, maintaining agency, preserving the ability to revise choices. An AI system aligned with procedural values would not pursue fixed objectives but would maintain conditions under which humans can continue to explore and refine their evolving preferences. The goal shifts from satisfying desires to preserving the capacity for desire formation.

Stuart Russell's proposal for AI systems that maintain uncertainty about human values and defer to human judgment represents a step toward this relational conception. Rather than confidently pursuing learned preferences, such systems would treat human values as fundamentally uncertain and seek to reduce that uncertainty through ongoing interaction. This acknowledges value instability not as a problem to be solved but as a feature to be accommodated.

More radical reconceptions may be necessary. If the self is less unified than we imagine, perhaps alignment should target not individual preferences but the ecological conditions that allow diverse human impulses to find expression and negotiation. An aligned AI might function less like an agent pursuing objectives and more like an environment that facilitates human value exploration—not giving us what we want but helping us discover what we might come to want under conditions of greater reflection, information, and freedom.

These frameworks share a crucial insight: that alignment with dynamic beings cannot be achieved through static solutions. They suggest that the goal of AI alignment research might be not to solve the alignment problem once and permanently, but to develop systems capable of maintaining beneficial relationships with humans across the endless evolution of our values. This is a harder problem than aligning with fixed objectives—but it may be the actual problem we face, once we abandon the convenient fiction of stable human preferences.

Takeaway

Rather than treating alignment as hitting a fixed target, we might reconceive it as maintaining beneficial relationships with beings whose values are perpetually evolving—shifting focus from satisfying desires to preserving the capacity for desire formation.

The alignment problem, as conventionally conceived, may rest on a foundation of philosophical quicksand. The assumption that humans possess coherent, stable values awaiting faithful implementation by AI systems does not survive serious scrutiny. What we find instead are context-dependent constructions, post-hoc narratives, and perpetually shifting preferences that resist reduction to any fixed target.

This is not cause for despair but for intellectual recalibration. Recognizing the dynamic, constructed nature of human values does not make alignment impossible—it reveals what alignment must actually contend with. The challenge transforms from a targeting problem to a relational one: not how to hit a fixed objective, but how to maintain beneficial relationships with beings whose values are fundamentally in flux.

Perhaps the deepest implication is this: in building AI systems that must grapple with the instability and incoherence of human values, we may finally be forced to confront truths about ourselves that philosophy has long recognized but technology has allowed us to ignore. The alignment problem, properly understood, is not merely a technical challenge—it is an invitation to deeper self-knowledge.