Tokenization: The Atomic Unit of LLMs
"If you don't understand tokenization, you don't understand why LLMs fail at simple tasks."
Tokenization is the process of converting raw text into a sequence of integers (IDs) that a model can process. It is the very first step in the pipeline and often the source of many "hallucinations" related to math, spelling, and coding.
Why Do We Need Tokenization?
Computers understand numbers, not strings. We need a way to map text to numbers.
The Spectrum of Granularity
We could tokenize at different levels:
| Method | Vocabulary Size | Sequence Length | Pros | Cons |
|---|---|---|---|---|
| Character | Small (~100-256) | Very Long | No simple OOV (Out-of-Vocabulary) issues | Context window fills up fast; individual characters lack meaning. |
| Word | Massive (1M+) | Short | Semantically rich | "Rare word" problem; huge embedding matrix parameters. |
| Subword (BPE) | Optimal (~32k-100k) | Medium | Balances efficiency and flexibility. | Complexity in implementation. |
Modern LLMs universally use Subword Tokenization (specifically BPE or variants).
2025 State of Tokenization
Key Developments:
- Byte-level BPE is now standard (GPT-4o, Llama 3/4) - handles all Unicode without OOV errors
- tiktoken dominance: OpenAI's tokenizer is 3-6x faster than alternatives, becoming de facto standard
- Multilingual optimization: SentencePiece with Unigram outperforms BPE for morphologically rich languages
- Efficiency improvements: BlockBPE and parallel tokenization for faster inference
Byte Pair Encoding (BPE)
How It Works
BPE is an iterative algorithm that starts with characters and keeps merging the most frequent adjacent pair of tokens.
- Initialize: Vocabulary = all individual characters (or bytes for byte-level BPE).
- Count: Find the most frequent pair of adjacent tokens in the corpus (e.g., "e" and "r" → "er").
- Merge: Create a new token for that pair.
- Repeat: Continue until a target vocabulary size (e.g., 32k) is reached.