For nearly a decade, recurrent neural networks were the default architecture for anything involving sequences—language, time series, speech. They processed tokens one at a time, maintaining a hidden state that theoretically carried forward everything the model had seen. In practice, that theory broke down badly.

The progression from vanilla RNNs to LSTMs to GRUs to transformers wasn't a smooth upgrade path. It was a series of increasingly creative workarounds for a fundamental architectural flaw: the assumption that sequential data must be processed sequentially. Each innovation peeled back another layer of that assumption until the transformer discarded it entirely.

Understanding this evolution isn't just history. The design pressures that drove each transition—parallelism, long-range dependencies, gradient stability—are the same pressures that shape architectural decisions today. Knowing why recurrence failed tells you something important about what makes attention work.

RNN Limitations: The Sequential Bottleneck

The vanilla RNN has an elegant premise: compress the entire history of a sequence into a fixed-size hidden state vector, update it at each timestep, and use it for prediction. The update rule is simple—apply a weight matrix to the previous hidden state, combine it with the current input, and pass through a nonlinearity. This recurrence is what gives the architecture its name and its fatal weakness.

The first bottleneck is computational. Because each timestep depends on the output of the previous one, you cannot parallelize across sequence positions. Processing a 512-token sentence requires 512 sequential matrix multiplications. On modern GPUs designed for massive parallelism, this is an architectural mismatch that no amount of hardware can fully compensate for.

The second bottleneck is informational. That fixed-size hidden state vector must encode everything relevant from the entire preceding sequence. For short sequences, this works. For a paragraph or a document, you're asking a 256- or 512-dimensional vector to carry hundreds of distinct facts, syntactic relationships, and contextual cues. Information gets overwritten. The network forgets.

The third bottleneck is gradient-based. Training RNNs requires backpropagation through time—unrolling the recurrence and computing gradients across every timestep. Gradients must flow backward through the same weight matrix applied repeatedly. If the largest singular value of that matrix exceeds one, gradients explode. If it's below one, they vanish. In practice, vanilla RNNs struggle to learn dependencies beyond roughly 10–20 timesteps. This isn't a tuning problem. It's a structural property of repeated matrix multiplication.

Takeaway

When your architecture forces all information through a single fixed-size bottleneck at every step, information loss isn't a bug—it's an inevitability. The capacity of the channel, not the capacity of the model, becomes the binding constraint.

LSTM and GRU Patches: Gating the Decay

The Long Short-Term Memory network, introduced by Hochreiter and Schmidhuber in 1997, attacked the vanishing gradient problem directly. Its key insight was architectural: add a cell state—a separate memory channel that runs parallel to the hidden state, governed by learned gates. The forget gate decides what to erase. The input gate decides what to write. The output gate decides what to expose. Crucially, the cell state passes through the network via addition, not repeated multiplication, which gives gradients a more stable path backward through time.

This worked remarkably well. LSTMs could learn dependencies across 100 or even 1,000 timesteps in controlled experiments. They became the backbone of machine translation, speech recognition, and language modeling for years. The Gated Recurrent Unit simplified the design by merging the cell state and hidden state and reducing three gates to two, achieving comparable performance with fewer parameters.

But gating mechanisms are patches, not solutions. They mitigate gradient vanishing—they don't eliminate the sequential computation bottleneck. An LSTM processing a 1,000-token document still requires 1,000 serial steps. They also don't fundamentally solve the information bottleneck. The cell state is larger and better managed, but it's still a fixed-size vector that must carry everything. For very long sequences, LSTMs still forget—they just forget more gracefully.

There's a subtler limitation. Because LSTMs process left to right, a token's representation is shaped primarily by what came before it, not what comes after. Bidirectional LSTMs address this by running two passes, but they double computation and still don't allow arbitrary pairwise interactions between positions. Every token sees the sequence through the lens of its neighbors, not the whole document. The architecture enforces a locality bias that often mismatches the structure of language.

Takeaway

Gating mechanisms taught us that controlling information flow matters more than increasing capacity. But adding better valves to a pipeline doesn't change the fact that it's still a pipeline—sequential, unidirectional, and fundamentally constrained by its topology.

Attention Liberation: Removing Recurrence Entirely

The 2017 paper Attention Is All You Need didn't introduce attention—Bahdanau's additive attention had been augmenting LSTMs since 2014. What it did was far more radical: it removed recurrence entirely and made attention the sole mechanism for relating positions in a sequence. Every token attends to every other token directly, in parallel, through learned query-key-value projections. There is no hidden state passed from one position to the next. There is no sequential dependency at all.

This single decision resolved all three RNN bottlenecks simultaneously. The computation bottleneck disappears because all positions are processed in parallel—a 512-token sequence requires a constant number of matrix operations regardless of length, limited only by memory. The information bottleneck disappears because no fixed-size vector must carry the entire history; instead, each token constructs its own representation by attending to whatever other tokens are relevant. The gradient bottleneck disappears because the path from any token to any other is a single attention step, not a chain of recurrent multiplications.

Multi-head attention adds another critical dimension. Rather than computing a single attention pattern, the transformer runs multiple attention heads in parallel, each with its own learned projection. Different heads can specialize—one tracking syntactic structure, another tracking coreference, another tracking semantic similarity. This is architecturally equivalent to giving the model multiple independent communication channels between positions, each tuned to different types of relationships.

The cost is real: self-attention scales quadratically with sequence length in both time and memory, compared to the linear scaling of RNNs. For very long sequences—tens of thousands of tokens—this becomes prohibitive without modifications like sparse attention or linear attention approximations. But for the sequence lengths that matter most in practice, the parallelism advantage overwhelms the quadratic cost. A transformer processes a 512-token sequence on a modern GPU in a fraction of the time an LSTM would take, because it converts a 512-step serial computation into a handful of massively parallel matrix multiplications.

Takeaway

The transformer's real innovation wasn't attention—it was the architectural courage to remove recurrence entirely and let every position communicate with every other position directly. Sometimes the biggest breakthroughs come not from adding new components, but from removing assumptions you didn't realize were optional.

The path from RNN to transformer is a case study in how architectural constraints shape—and limit—what a model can learn. Recurrence imposed seriality, bottlenecked information, and destabilized gradients. Gating mechanisms managed the symptoms. Attention eliminated the cause.

The lesson extends beyond sequence modeling. When a system underperforms, the instinct is to add complexity—more parameters, more layers, more tricks. Sometimes the right move is to remove the structural assumption that's forcing the problem in the first place.

Every architecture encodes assumptions about its data. The transformer's assumption—that any token might need to attend to any other—turned out to be a far better match for language than sequential recurrence. Choosing the right inductive bias remains the most consequential design decision in deep learning.