When you ask a language model about the capital of France, something happens in its billions of parameters. But what happens? For years, neural networks have been treated as inscrutable black boxes—systems that work but resist explanation. Mechanistic interpretability aims to change that, opening these systems to reveal the computational structures within.

The stakes extend beyond academic curiosity. Understanding what transformers learn bears directly on questions of AI safety, capability forecasting, and the fundamental nature of machine intelligence. If these systems merely memorize statistical patterns, they represent sophisticated but limited tools. If they construct genuine representations of the world—models that capture causal structure and enable generalization—we face something far more consequential.

Recent research has begun illuminating the internal geometry of these models. Researchers can now identify specific features, trace computational circuits, and evaluate evidence for world models. What emerges is neither the blank statistical associationism some skeptics propose nor the human-like understanding optimists imagine. The reality proves stranger and more instructive than either extreme.

Feature Representation Discovery

The first breakthrough in understanding transformers came from finding interpretable features within their activation spaces. Early work discovered that individual neurons sometimes encode recognizable concepts—a neuron firing for mentions of the Golden Gate Bridge, another responding to legal terminology. But this neuron-level interpretability proved the exception rather than the rule.

The deeper insight emerged from recognizing that concepts are encoded not in individual neurons but in directions within high-dimensional activation space. This is the superposition hypothesis: models compress far more features than they have dimensions by encoding concepts as overlapping, nearly orthogonal directions. A single neuron participates in representing thousands of distinct features, and a single feature distributes across thousands of neurons.

Sparse autoencoders have become the primary tool for extracting these features. By training auxiliary networks to reconstruct model activations using sparse combinations of learned features, researchers can decompose the model's internal representations into interpretable components. The results reveal features for abstract concepts—deception, sycophancy, code correctness—alongside concrete ones like specific entities or syntactic patterns.

What makes these discoveries significant is their consistency. The same features appear across different contexts, suggesting stable internal representations rather than ad-hoc pattern matching. A feature for 'first person plural' activates reliably whether discussing philosophy or recipes. This stability hints at something like genuine conceptual structure.

Yet interpretability remains incomplete. We can identify features, but we cannot yet enumerate all features a model uses or prove we've found the 'true' representation rather than one of many equivalent decompositions. The features we find may reflect our analysis methods as much as the model's internal organization. This epistemic humility must accompany every claim about what models 'really' represent.

Takeaway

Models encode concepts as directions in high-dimensional space, not individual neurons—meaning interpretation requires understanding geometry, not just components.

Circuit Analysis Methods

Knowing what features exist is only half the puzzle. The other half is understanding how the model computes with them. Circuit analysis traces the flow of information through transformer layers, identifying the specific attention heads and MLP layers that implement recognizable computations.

The methodology involves careful intervention experiments. Researchers ablate specific components, patch activations between different inputs, and measure how these changes affect model outputs. When removing a particular attention head eliminates a specific capability while leaving others intact, that head likely implements part of the relevant circuit.

This approach has revealed surprisingly interpretable structures. 'Induction heads' copy patterns from earlier in context, explaining much of in-context learning. 'Name mover' heads in indirect object identification shift attention from pronouns to their referents. Circuits for greater-than comparison, modular arithmetic, and multi-step reasoning have been mapped in detail.

The circuits discovered often implement algorithms recognizable to computer scientists. A model performing modular addition doesn't learn arbitrary associations but constructs something like a Fourier basis for representing cyclic quantities. Models tracking boolean states in code maintain something resembling symbolic variables. The computations are neither human-like cognition nor brute statistical association but a third category—learned algorithms that solve problems efficiently given the constraints of transformer architecture.

Circuit analysis faces scaling challenges. Small models yield to complete circuit mapping; large models remain partially opaque. We can identify circuits for specific behaviors, but the full computational graph of a frontier model exceeds current analysis capabilities. The field progresses through careful case studies rather than comprehensive understanding.

Takeaway

Transformers implement recognizable algorithms through distributed circuits—not random statistical associations, but learned computational structures that solve problems efficiently within architectural constraints.

World Model Hypothesis

The most contentious question in interpretability is whether language models construct world models—internal representations that capture causal structure and enable genuine understanding—or merely learn sophisticated surface statistics that mimic such understanding.

Evidence has accumulated on both sides. Models trained purely on text learn to predict game states in chess and Othello, tracking legal moves and board positions despite never seeing explicit state representations. They develop internal representations of spatial relationships, temporal sequences, and causal dependencies. These findings suggest something more than surface pattern matching.

The strongest evidence comes from probing experiments that find linear representations of world state within model activations. A linear classifier can extract board position from Othello-playing models with high accuracy. Models representing geographic locations organize them spatially within activation space. Color terms cluster according to perceptual similarity. These geometric structures suggest the models construct representations that reflect genuine properties of the domains they model.

Yet skeptics raise important objections. What looks like a world model might be a sophisticated compression of statistical regularities that happens to mirror world structure because the world generated the training data. The model might represent 'what text about X looks like' rather than 'what X is.' Distinguishing these possibilities proves surprisingly difficult.

The truth likely involves both elements. Models clearly learn more than simple n-gram statistics—their representations exhibit structure that enables systematic generalization. But these representations may differ fundamentally from human conceptual understanding. They capture patterns useful for prediction without necessarily constituting the kind of causal models that support robust reasoning. The question may be less 'do models have world models?' than 'what kind of world models, and for what purposes?'

Takeaway

Language models construct internal representations with geometric structure reflecting domain properties—but whether this constitutes genuine understanding or sophisticated statistical compression remains the central open question.

Mechanistic interpretability has progressed from aspiration to methodology. We can now identify features, trace circuits, and probe for world models with increasing precision. What we find defies simple characterization—neither the blank associationism of skeptics nor the human-like understanding of enthusiasts.

Transformers learn structured representations organized geometrically in activation space. They implement recognizable algorithms through distributed circuits. They construct internal models that reflect genuine properties of the domains they learn from. Whether this constitutes 'understanding' depends on definitions we have yet to agree upon.

The practical implications are significant. Better interpretability enables better safety evaluation, capability assessment, and targeted intervention. The philosophical implications run deeper still—these systems force us to articulate what we mean by understanding, representation, and knowledge itself.