Why Transformer Layers Learn Hierarchical Representations

5 min read

Transformer layers spontaneously organize into hierarchies processing surface patterns, syntax, semantics, and world knowledge at increasing depths.

Probing experiments show early layers encode lexical features, middle layers capture grammatical structure, and deep layers represent abstract meaning.

Attention patterns evolve from local positional focus in early layers to long-range semantic connections in deeper layers.

The residual stream acts as a shared communication channel where each layer iteratively refines rather than replaces previous representations.

Understanding this hierarchy enables better layer selection for feature extraction, more effective pruning, and improved training strategies.

When you stack transformer layers, something remarkable happens that wasn't explicitly programmed. The network spontaneously organizes itself into a hierarchy of increasingly abstract representations. Early layers detect surface patterns. Middle layers capture relationships. Deep layers encode high-level concepts and world knowledge.

This emergent behavior mirrors how biological neural systems process information—from raw sensory data to abstract reasoning. But unlike brains, we can peer inside transformers and measure exactly what each layer learns. The results reveal a surprisingly consistent computational architecture across different models and training objectives.

Understanding this hierarchy isn't just academic curiosity. It directly informs how we design, train, and deploy transformer systems. Knowing which layers encode what information helps us build more efficient models, diagnose failures, and transfer knowledge between tasks. Let's examine the evidence for hierarchical learning and the mechanisms that drive it.

Layer-wise Probing Reveals Specialized Representations

Researchers probe transformer layers by training simple classifiers on their hidden states to predict linguistic properties. This technique reveals a striking pattern: different layers specialize in different types of information. The specialization follows a consistent trajectory from surface form to deep meaning.

In BERT-style models, layers 1-3 primarily encode positional and lexical information—word identity, part of speech, basic morphology. Layers 4-8 capture syntactic structure, including dependency relationships and phrase boundaries. The final layers encode semantic information: coreference, entity types, and factual knowledge about the world.

This pattern holds across model sizes and architectures, though the exact layer boundaries shift. GPT models show similar stratification, with early layers handling token prediction mechanics and later layers encoding increasingly abstract patterns. Notably, the best layer for a specific task depends on that task's abstraction level. Syntax probes peak in middle layers. Knowledge extraction works best from upper layers.

What drives this organization? The training objective never explicitly requests hierarchical processing. Instead, the hierarchy emerges because it's computationally efficient. Building complex representations from simpler ones requires fewer parameters than learning everything from scratch at each layer. The architecture creates pressure toward compositional structure.

Takeaway
When fine-tuning or extracting features from transformers, select layers based on your task's abstraction level—early layers for surface patterns, middle for structure, late for meaning and knowledge.

Attention Patterns Evolve from Local to Global

Attention heads provide another window into hierarchical processing. Their behavior changes systematically across depth. Early layers attend locally, focusing on nearby tokens and positional relationships. Heads in these layers often learn fixed patterns: attending to the previous token, the next token, or tokens at consistent relative positions.

As depth increases, attention becomes increasingly content-dependent and long-range. Middle-layer heads track syntactic dependencies that span many tokens—subjects finding their verbs, pronouns locating antecedents. These heads implement what we might call structural attention, following grammatical relationships regardless of distance.

The deepest layers show the most semantic attention patterns. Heads here connect concepts rather than positions or structures. They might link a person's name to their profession mentioned paragraphs earlier, or connect related entities across a document. Some heads in these layers appear to implement basic reasoning—aggregating information from multiple sources to update representations.

This local-to-global progression isn't coincidental. Each layer can only refine what previous layers provide. Building long-range semantic connections requires first establishing local context, then structural relationships, then meaning. The architecture forces a particular order of operations, and attention patterns reflect this constraint. Models trained on different objectives show similar progressions, suggesting this organization is fundamental to how transformers process sequential information.

Takeaway
When analyzing model behavior, examine attention patterns across multiple layers—looking only at final-layer attention misses the foundational local and structural processing that enables global semantic connections.

The Residual Stream as Iterative Refinement

A powerful interpretive framework views the transformer's residual connections not as skip paths, but as a shared communication channel. Each layer reads from this residual stream, performs computation, and writes its contribution back. The final representation accumulates all layers' contributions rather than replacing earlier work.

This perspective explains hierarchical learning elegantly. Early layers write basic features into the stream. Later layers read these features, combine them, and add higher-level abstractions. Nothing gets lost—the surface-level information from layer 1 remains accessible even at layer 24. Each layer adds a refinement rather than a replacement.

Empirically, this model predicts several observed phenomena. Removing middle layers degrades performance less than removing early or late layers—the residual stream preserves essential information even when processing steps are skipped. Layer outputs become increasingly aligned with the final output as depth increases, consistent with iterative refinement toward a target representation.

For practitioners, this framework suggests specific optimization strategies. Knowledge distillation works layer-by-layer because each layer's contribution is somewhat independent. Pruning can target layers with low contribution magnitude to the residual stream. And understanding that all layers share a common representational space enables techniques like layer-wise learning rates and progressive training that respect the hierarchical structure.

Takeaway
Think of transformer depth as iterative refinement on a shared representation rather than a pipeline of distinct stages—this mental model explains why techniques like progressive layer training and residual scaling improve training dynamics.

Transformers learn hierarchical representations not because we program them to, but because hierarchy is the efficient solution to the problem they're solving. Stacking layers creates computational pressure toward compositionality—building complex patterns from simpler ones.

The evidence converges from multiple directions: probing classifiers, attention visualization, and residual stream analysis all reveal the same structure. Surface features give way to syntax, then semantics, then world knowledge.

This understanding has immediate practical applications. Choose extraction layers based on task abstraction. Diagnose model failures by checking where the hierarchy breaks down. Design architectures that respect the natural progression from local to global processing. The hierarchy isn't a curiosity—it's a fundamental property to leverage.