Why Feed-Forward Networks in Transformers Store Knowledge

Image by Patrick Langwallner on Unsplash

4 min read

Feed-forward layers in transformers contain most of the model's parameters but were long treated as generic computation.

Recent research reveals they function as key-value memories, where the first projection matches patterns and the second injects associated knowledge.

Specific facts can be traced to small clusters of neurons in particular layers, especially mid-network FFNs.

This localization enables direct weight editing techniques like ROME and MEMIT that update facts without retraining.

Understanding FFNs as memory stores reframes scaling, interpretability, and the future of model maintenance.

When we think about what makes a transformer model intelligent, attention usually steals the spotlight. The self-attention mechanism is elegant, well-studied, and visually intuitive. But it accounts for only a fraction of a transformer's parameters.

The majority of weights live inside the feed-forward networks (FFNs)—the dense MLP layers sandwiched between attention blocks. For years, these layers were treated as generic function approximators, useful but unremarkable. Recent research tells a different story.

Evidence now suggests that FFN layers function as associative memory stores, encoding factual knowledge in their parameters using a key-value structure. This reframes how we think about model capacity, interpretability, and the possibility of editing facts without costly retraining.

Key-Value Memory Interpretation

The seminal work by Geva et al. (2021) reframed transformer FFN layers as neural memory networks. The two linear projections that make up an FFN—an up-projection followed by a down-projection separated by a nonlinearity—mirror the mathematical structure of key-value memory retrieval.

Consider the mechanics. The first weight matrix acts as a set of keys: each row computes a dot product with the input hidden state, producing an activation pattern. After the nonlinearity, this pattern weights the rows of the second matrix, which serve as values. The output is effectively a weighted sum of stored value vectors, retrieved by similarity to learned key patterns.

Empirically, individual keys correspond to interpretable input patterns. Lower-layer keys tend to fire on shallow lexical patterns—specific n-grams or syntactic templates. Upper-layer keys activate on semantic abstractions: sentences about geography, expressions of negation, descriptions of professions.

The corresponding values bias the output distribution toward tokens consistent with the matched pattern. A key that detects "the capital of France is" pairs with a value that boosts the probability of "Paris." This is not metaphor—it's a measurable, reproducible mechanism observable in models like GPT-2 and LLaMA.

Takeaway
Feed-forward layers are not opaque transformations. They are content-addressable memories where keys recognize patterns and values inject corresponding knowledge into the residual stream.

Knowledge Neuron Discovery

If FFNs store facts, can we locate where a specific fact lives? The Knowledge Neurons framework introduced by Dai et al. (2022) provides a method. Using integrated gradients, researchers attribute a model's prediction for a factual prompt to individual neurons in the FFN layers, ranking each by its contribution.

The results are striking. For a given fact—say, "Dante was born in Florence"—a small set of neurons, often fewer than a dozen, accounts for the bulk of the prediction. Suppress their activations and the model loses the fact. Amplify them and the model becomes more confident, even with paraphrased prompts.

Subsequent work like ROME (Rank-One Model Editing) by Meng et al. localized factual associations even more precisely, identifying that mid-layer FFNs in autoregressive transformers are the primary site of subject-attribute knowledge. The model retrieves these associations at the last subject token, where the FFN injects the relevant value vector.

Importantly, knowledge is distributed but localizable. A fact isn't stored in a single neuron, nor is it smeared uniformly across the network. It lives in a concentrated subspace within specific layers, which is precisely what makes targeted intervention feasible.

Takeaway
Facts in neural networks have addresses. They aren't perfectly point-like, but they are concentrated enough that causal interventions can isolate and manipulate them.

Editing and Updating Implications

Once you can locate a fact, you can change it. Techniques like ROME and MEMIT exploit the key-value structure directly: to update "the President of the United States is X" to "the President of the United States is Y," they compute a rank-one update to the down-projection matrix of a specific FFN layer.

The math is elegant. Treat the FFN as solving a linear system mapping keys to values. To insert a new association, derive the minimal weight perturbation that maps the relevant key vector to the desired value vector while preserving the model's behavior on unrelated keys. This is closed-form and requires no gradient descent.

The practical implications are significant. Full fine-tuning on a single fact is wasteful and risks catastrophic forgetting. Direct editing modifies kilobytes of weights instead of gigabytes, completes in seconds, and—when done carefully—generalizes to paraphrases while leaving unrelated knowledge intact.

Limitations remain. Edits can fail to propagate through multi-hop reasoning, and large-scale editing can degrade model coherence. But the trajectory is clear: model maintenance is moving from costly retraining toward surgical knowledge operations, made possible by understanding what the FFN actually is.

Takeaway
Mechanistic understanding unlocks mechanistic control. When we know how knowledge is encoded, we can edit it as deliberately as patching a database record.

The shift from viewing FFN layers as generic computation to seeing them as structured memory stores is more than an interpretability curiosity. It changes how we reason about model capacity, scaling, and lifecycle management.

Parameter count matters because parameters are storage. Scaling laws look different when you recognize that much of a model's growth is allocated to factual recall rather than reasoning machinery.

Treat the transformer not as a monolithic function, but as an architecture with distinct components serving distinct purposes: attention for routing, FFNs for retrieval. Designing better systems begins with that distinction.