The Surprising Power of Simple Tokenization Choices

5 min read

Tokenization algorithms like BPE, WordPiece, and Unigram pursue different optimization objectives that affect which subword units a model learns.

Vocabulary size creates a three-way trade-off between sequence length, embedding parameters, and coverage of rare words.

English-dominant training corpora cause non-Latin scripts to fragment into far more tokens, creating functional inequity in model capacity.

Byte-level tokenizers guarantee universal coverage but don't solve the underlying efficiency disparities across languages.

Treating tokenization as a first-class architectural decision rather than preprocessing is essential for building capable, equitable AI systems.

Before a neural network sees a single word, something else happens first. Text gets chopped into pieces—tokens—that the model actually processes. This seemingly mundane preprocessing step shapes everything that follows.

Most practitioners treat tokenization as a solved problem. Pick BPE, set vocabulary size to 32,000, move on. But this casual approach hides consequential trade-offs. The way you slice text determines how long sequences become, how many parameters your embeddings consume, and whether your model treats all languages fairly.

Tokenization choices made years ago still echo through today's largest models. GPT-4, Claude, and Gemini all inherit architectural constraints from their tokenizers. Understanding these constraints reveals why some tasks remain surprisingly difficult and why multilingual AI often fails speakers of non-European languages.

Subword Segmentation Trade-offs

Three algorithms dominate modern tokenization: Byte Pair Encoding (BPE), WordPiece, and Unigram. They share a goal—split text into subword units—but pursue it through fundamentally different optimization objectives.

BPE works bottom-up. Start with characters, repeatedly merge the most frequent adjacent pairs. This greedy approach optimizes for compression on the training corpus. The result: common words stay whole while rare words fragment into pieces. "tokenization" might become one token; "defenestration" becomes four.

WordPiece, developed for BERT, takes a likelihood-based approach. Instead of raw frequency, it merges pairs that maximize the probability of the training data under a language model. This subtle shift means WordPiece sometimes makes different choices than BPE on the same corpus—preferring merges that create more predictive units rather than just more common ones.

Unigram flips the script entirely. It starts with a large vocabulary and removes tokens, keeping those that minimize the loss of representing the training data. This top-down approach naturally handles ambiguous segmentations. When "going" could be "go" + "ing" or "goi" + "ng", Unigram assigns probabilities to each path. The result is often more linguistically coherent boundaries, which matters for morphologically rich languages like Turkish or Finnish.

Takeaway
Your tokenization algorithm isn't just a preprocessing choice—it's an implicit prior about what linguistic units matter. BPE optimizes for compression, WordPiece for prediction, Unigram for probabilistic coverage.

Vocabulary Size Impact

Vocabulary size creates a three-way trade-off that every model architect must navigate. Larger vocabularies mean shorter sequences, smaller vocabularies mean more parameter-efficient embeddings, and both extremes sacrifice coverage of rare words.

Consider the arithmetic. A 50,000-token vocabulary with 1024-dimensional embeddings requires 51 million parameters just for the input layer. Double the vocabulary, double those parameters. For large language models, embedding layers can consume 10-20% of total parameters. That's compute and memory you're not spending on actual reasoning.

But small vocabularies extract a different cost. The word "anthropomorphization" might become eight tokens instead of two. Your 2048-token context window now holds fewer actual words. Attention mechanisms scale quadratically with sequence length, so longer tokenized sequences hit harder than linear. A 2x increase in tokens means roughly 4x more attention compute.

The sweet spot depends on your task. Code models often use larger vocabularies because identifier names vary wildly and fragmenting them destroys semantic coherence. Multilingual models face pressure toward larger vocabularies to cover diverse scripts. GPT-4's ~100,000 token vocabulary reflects these pressures, while smaller models often stick to 32,000-50,000 tokens to manage parameter counts.

Takeaway
Vocabulary size isn't a hyperparameter to tune—it's an architectural commitment that trades embedding parameters against sequence length against coverage. Choose based on your deployment constraints, not convention.

Multilingual Tokenization Failures

English speakers rarely notice tokenization. Common words become single tokens; the system feels natural. But train a BPE tokenizer primarily on English text, then apply it to Korean or Thai, and the disparity becomes stark.

A simple greeting in Korean might consume five to ten times more tokens than its English equivalent. This isn't just inefficient—it's functionally inequitable. The same context window holds dramatically less content. The same API pricing charges dramatically more. The same model effectively has less capacity to reason in these languages.

The root cause is training corpus composition. BPE and WordPiece learn merges from data. If 80% of your training text is English, English patterns dominate the vocabulary. Languages with different scripts or morphological structures get tokenized at near-character level, never developing efficient representations.

Byte-level tokenizers emerged partly to address this. Models like GPT-2 and its successors operate on UTF-8 bytes, guaranteeing they can represent any text. But byte-level approaches push the problem rather than solving it—those Korean characters still require more bytes than Latin ones. True equity requires intentional multilingual vocabulary construction, like the approach used in mT5 and XLM-RoBERTa, where vocabulary slots are explicitly allocated across language families.

Takeaway
Tokenization creates invisible linguistic privilege. A model's effective capacity varies by language based on decisions made before training even began. Byte-level approaches guarantee coverage but not equity.

Tokenization sits at the uncomfortable intersection of engineering expedience and downstream capability. The choices feel trivial—algorithm selection, vocabulary size, training corpus—but their effects compound through every layer of the model.

The next time a model struggles with morphologically complex languages, fails on code completion for unusual identifiers, or burns context on simple text, look at the tokenizer first. The architecture you can see often gets blamed for problems the tokenizer created.

Building better AI systems requires treating tokenization as a first-class architectural decision, not an afterthought. The vocabulary is the first bottleneck. Everything else flows through it.