What Would Artificial General Intelligence Actually Be? Definitions and Disagreements

short-coated white and black dog sleeping at doorstep

9 min read

The concept of artificial general intelligence lacks a consensus definition, with competing proposals emphasizing task breadth, transfer learning, environmental robustness, or open-ended abstraction as the core criterion of generality.

Every proposed benchmark for AGI — from the Turing Test to economic productivity measures — encodes implicit assumptions about what intelligence is and risks becoming a target that decouples from genuine capability.

Definitional ambiguity directly shapes safety research priorities, determining whether alignment efforts focus on constraining powerful optimizers or managing recursively self-improving systems.

Policy and regulatory frameworks reference human-level AI without specifying which capabilities, at what level, or across which domains, creating governance structures built on undefined foundations.

The field would benefit from disciplined pluralism — maintaining multiple explicit definitions and requiring that any claim about AGI timelines or risks specify which conception of general intelligence is operative.

When researchers, policymakers, and technologists invoke the phrase artificial general intelligence, what exactly are they pointing at? The term saturates contemporary discourse about AI's trajectory, yet it functions less as a precise scientific concept and more as a Rorschach blot — each observer projecting onto it a slightly different vision of what intelligence is, what it does, and what it would mean for a machine to possess it. This ambiguity is not merely academic. Billions of dollars in research funding, sweeping regulatory proposals, and existential safety frameworks all hinge on what we take AGI to mean.

The difficulty begins with the modifier general. Human cognition is often held up as the canonical example of general intelligence, yet our own generality is riddled with constraints — cognitive biases, domain-specific learning curves, catastrophic forgetting in certain contexts. If generality is a spectrum rather than a binary, then claiming a system has achieved AGI requires specifying how general, across what domains, and by whose metric. These are not incidental questions. They constitute the very substance of the claim.

What follows is a survey of the competing definitions, the measurement strategies proposed to operationalize them, and the argument that getting this right — achieving genuine conceptual clarity — is not a philosophical luxury but an engineering and policy necessity. The stakes of building something we cannot precisely define are, to put it mildly, considerable.

Generality Criteria: What Makes Intelligence 'General'?

The most intuitive definition of AGI appeals to task breadth: a system qualifies as generally intelligent if it can perform any intellectual task that a human being can. This formulation, often attributed to informal usage within DeepMind and OpenAI, has the virtue of simplicity. But it immediately encounters a boundary problem. Does any intellectual task include composing a sonnet, navigating social deception, experiencing grief, or understanding a joke that depends on embodied experience? The scope of 'intellectual task' is itself undefined, and different scoping choices yield radically different goalposts.

A more technically grounded approach focuses on task-transfer and meta-learning. Under this view, generality is not about the catalogue of tasks a system can perform but about its capacity to adapt to novel tasks with minimal additional training. Marcus and Davis have argued that true generality requires robust transfer across domains — not merely interpolation within a training distribution but genuine extrapolation to unfamiliar problem structures. This criterion raises the bar considerably, since current large language models, despite their breadth, often struggle with out-of-distribution reasoning that even young children handle effortlessly.

A third school of thought, influenced by Legg and Hutter's formal framework, defines intelligence as an agent's ability to achieve goals across a wide range of environments. Here, generality is a function of environmental diversity — the more varied the contexts in which a system can act competently, the more general its intelligence. This formulation is mathematically elegant but practically intractable: the space of possible environments is infinite, and any finite test suite risks privileging certain kinds of generality over others.

More recently, proposals from researchers like Chollet have shifted the emphasis toward open-ended capability acquisition — the ability not merely to solve problems but to define new ones, to generate novel abstractions, and to build conceptual scaffolding that wasn't present in the training signal. Chollet's Abstraction and Reasoning Corpus (ARC) was designed explicitly to test this kind of fluid, domain-general reasoning, and current AI systems perform remarkably poorly on it relative to humans. This suggests that whatever generality current systems possess, it may be qualitatively different from human-style generality.

The deeper issue is that these criteria are not merely different metrics for the same underlying property. They reflect fundamentally different theories of what intelligence is. Task breadth treats intelligence as a portfolio. Transfer learning treats it as an optimization capacity. Environmental robustness treats it as adaptive fitness. Open-ended abstraction treats it as a creative generative process. Until the field converges on which of these — or which combination — constitutes the relevant notion of generality, the term AGI will continue to mean different things to different people, with all the confusion that entails.

Takeaway
Generality in intelligence is not a single property but a family of distinct capacities — task breadth, transfer, environmental robustness, and open-ended abstraction — and conflating them obscures more than it reveals.

Benchmark Limitations: The Measurement Problem

If defining AGI is fraught, measuring it is arguably worse. The oldest and most famous proposal — the Turing Test — asks whether a machine can produce linguistic behavior indistinguishable from a human's. As a cultural touchstone it has been enormously influential, but as a scientific criterion it is deeply flawed. Conversational mimicry is neither necessary nor sufficient for general intelligence. A system could pass the Turing Test through sophisticated pattern matching and social manipulation without possessing anything resembling flexible reasoning. Conversely, a genuinely intelligent system might fail it simply by being too precise, too fast, or too alien in its communicative style.

The field has since moved toward benchmark suites — standardized collections of tasks designed to probe different cognitive capacities. MMLU, BIG-bench, HELM, and similar frameworks test knowledge, reasoning, and language understanding across hundreds of domains. These have real value as engineering tools, but they share a critical limitation: they are static. Once a benchmark exists, it becomes a target. Systems are optimized — consciously or otherwise — to perform well on it, and performance on the benchmark decouples from the underlying capability the benchmark was designed to measure. This is Goodhart's Law applied to intelligence measurement, and it is remarkably persistent.

An alternative approach, gaining traction in some policy circles, defines AGI in economic terms: a system qualifies as AGI if it can perform the majority of economically valuable work currently done by humans. OpenAI's charter implicitly gestures toward this framing. It has the advantage of being concrete and consequential — economic impact is, after all, what many stakeholders actually care about. But it smuggles in enormous normative assumptions. Economic value is culturally contingent, market-dependent, and shaped by institutional structures. A system that can replace most white-collar labor but cannot navigate a novel physical environment or engage in genuine scientific discovery would satisfy the economic definition while falling short of what many researchers mean by general intelligence.

More fundamentally, all these measurement approaches share a methodological problem that Stuart Russell has articulated with particular clarity: we tend to measure AI capabilities by comparison to human performance on human-designed tasks. This anthropocentric benchmarking may be systematically misleading. A truly general artificial intelligence might excel in ways our tests cannot capture and fail in ways our tests do not probe. The measurement instruments are shaped by our own cognitive architecture, and there is no guarantee that they map cleanly onto the architecture of a radically different kind of mind.

The upshot is that we lack not just a definition of AGI but a credible operationalization — a way to know, with reasonable confidence, when we have built it. This epistemic gap is not a minor inconvenience. It means that claims about AGI timelines, whether optimistic or pessimistic, are anchored to criteria that remain contested. We are, in a very real sense, debating the distance to a destination whose coordinates we have not agreed upon.

Takeaway
Every proposed AGI benchmark — conversational, task-based, or economic — encodes implicit assumptions about what intelligence is, and optimizing for any fixed metric risks mistaking the map for the territory.

Conceptual Clarity Value: Why Definitions Are Not Pedantry

It might seem that definitional disputes are a philosophical indulgence — interesting in seminar rooms but irrelevant to the engineers training the next generation of models. This view is mistaken, and dangerously so. The definition of AGI directly shapes what safety research prioritizes. If AGI is understood as a system that can match human performance across a broad task suite, then alignment research focuses on constraining a powerful optimizer. If AGI is understood as open-ended capability acquisition, then the relevant safety concern shifts toward systems that recursively improve in unpredictable ways. These are not the same problem, and the resources allocated to each depend on which definition holds sway.

Policy is equally sensitive to definitional choices. The EU AI Act, executive orders from the White House, and proposed international governance frameworks all reference advanced AI systems that approach or exceed human-level capability. But which capabilities? At what level? Across which domains? Regulatory language that is vague about what constitutes AGI risks being either toothless — failing to trigger when a genuinely dangerous capability emerges — or overly broad, stifling beneficial research that falls far short of any reasonable AGI threshold.

There is also an epistemic dimension that deserves attention. The way we define AGI shapes how we understand what we are trying to create, and this in turn influences research direction. If AGI is framed as a scaling problem — more data, more compute, more parameters — then the field gravitates toward brute-force approaches. If it is framed as requiring qualitative breakthroughs in representation, reasoning, or grounding, then research diversifies toward cognitive architectures, embodied AI, and neurosymbolic integration. The definition is not downstream of the science; it is upstream of it.

Russell's work on the control problem makes this point with particular force. He argues that the greatest risk comes not from building a system that is maximally capable but from building one whose objectives are misspecified. If we cannot precisely articulate what general intelligence means, we cannot precisely articulate what it would mean for such a system to be aligned with human values. The alignment problem is, at its root, a problem of specification, and specification requires definitional clarity.

None of this means we need a single, universally accepted definition before proceeding. Science often advances productively with provisional concepts. But the field would benefit enormously from explicit acknowledgment of which definition is operative in any given claim, prediction, or policy proposal. When someone says AGI is five years away, or fifty, or impossible, the most informative response is not agreement or disagreement but rather: which AGI? The question is not pedantic. It is the question.

Takeaway
Definitional precision about AGI is not an academic luxury — it is the precondition for coherent safety research, meaningful policy, and honest communication about what the field is actually pursuing.

The term artificial general intelligence currently functions as a shared signifier without a shared referent. Researchers invoke it to describe systems that are maximally broad, maximally adaptive, maximally autonomous, or some combination thereof — and these are not equivalent aspirations. The lack of convergence is not a failure of effort but a reflection of genuine, unresolved disagreements about the nature of intelligence itself.

This conceptual fragmentation has consequences that cascade through safety research, regulatory design, and public understanding. Every timeline prediction, every governance framework, every existential risk assessment inherits the ambiguity of its foundational terms. Clarity here is not a nicety; it is infrastructure.

Perhaps the most productive stance is one of disciplined pluralism: maintaining multiple working definitions, being explicit about which is operative in any given context, and resisting the temptation to let a single framing foreclose inquiry. The question what would AGI actually be? may have no final answer — but the rigor with which we pursue it will shape the intelligence we ultimately build.