Skip to main content

Tokenization: The Atomic Unit of LLMs

"If you don't understand tokenization, you don't understand why LLMs fail at simple tasks."

Tokenization is the process of converting raw text into a sequence of integers (IDs) that a model can process. It is the very first step in the pipeline and often the source of many "hallucinations" related to math, spelling, and coding.


Why Do We Need Tokenization?

Computers understand numbers, not strings. We need a way to map text to numbers.

The Spectrum of Granularity

We could tokenize at different levels:

MethodVocabulary SizeSequence LengthProsCons
CharacterSmall (~100-256)Very LongNo simple OOV (Out-of-Vocabulary) issuesContext window fills up fast; individual characters lack meaning.
WordMassive (1M+)ShortSemantically rich"Rare word" problem; huge embedding matrix parameters.
Subword (BPE)Optimal (~32k-100k)MediumBalances efficiency and flexibility.Complexity in implementation.

Modern LLMs universally use Subword Tokenization (specifically BPE or variants).

2025 State of Tokenization

Key Developments:

  • Byte-level BPE is now standard (GPT-4o, Llama 3/4) - handles all Unicode without OOV errors
  • tiktoken dominance: OpenAI's tokenizer is 3-6x faster than alternatives, becoming de facto standard
  • Multilingual optimization: SentencePiece with Unigram outperforms BPE for morphologically rich languages
  • Efficiency improvements: BlockBPE and parallel tokenization for faster inference

Byte Pair Encoding (BPE)

How It Works

BPE is an iterative algorithm that starts with characters and keeps merging the most frequent adjacent pair of tokens.

  1. Initialize: Vocabulary = all individual characters (or bytes for byte-level BPE).
  2. Count: Find the most frequent pair of adjacent tokens in the corpus (e.g., "e" and "r" → "er").
  3. Merge: Create a new token for that pair.
  4. Repeat: Continue until a target vocabulary size (e.g., 32k) is reached.

Interactive Example

Consider the corpus: ["hug", "pug", "pun", "bun"]

  1. Start: h, u, g, p, n, b
  2. Most frequent pair: u + gug
  3. New state: h, ug, p, n, b (Plus ug token)
  4. Next frequent: u + nun
  5. Final tokens: ug, un, h, p, b

Now, hug is encoded as [h, ug].

Byte-Level BPE (2025 Standard)

Problem: Traditional character-level BPE has a base vocabulary of ~100K characters (all Unicode), and can still encounter OOV errors.

Solution: Byte-level BPE operates on UTF-8 bytes directly:

  • Base vocabulary: 256 bytes (covers ALL Unicode without OOV)
  • Example: "é" could be 195, 169 (two bytes) or a single byte if learned as a merge

Why it matters:

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")

# Handles any Unicode without errors
print(enc.encode("你好世界")) # Chinese: [32409, 30255, 9892, 162]
print(enc.encode("こんにちは")) # Japanese: [32864, 25669, 32465, 27414, 28821]
print(enc.encode("مرحبا")) # Arabic: [2174, 1945, 10982, 2686]

Adoption:

  • GPT-2/3/4: Byte-level BPE
  • Llama 3/4: Uses GPT-4 tokenizer via tiktoken
  • Claude: Custom byte-level BPE variant

The Python Implementation (simplified)

def get_stats(vocab):
pairs = collections.defaultdict(int)
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols)-1):
pairs[symbols[i],symbols[i+1]] += freq
return pairs

def merge_vocab(pair, v_in):
v_out = {}
bigram = re.escape(' '.join(pair))
p = re.compile(r'(?<!\S)' + bigram + r'(?!\\S)')
for word in v_in:
w_out = p.sub(''.join(pair), word)
v_out[w_out] = v_in[word]
return v_out

Algorithm Showdown: BPE vs WordPiece vs Unigram

For interviews, know the difference between these three.

FeatureBPE (GPT-2/3/4, Llama)WordPiece (BERT)Unigram (T5, ALBERT)
Merge StrategyDeterministic: merge most frequent pair.Probabilistic: merge pair boosting likelihood of data (PMI).Probabilistic: Start massive, prune least useful tokens.
PhilosophyBottom-up (Chars → Subwords).Bottom-up.Top-down (All substrs → Keep best).
RegularizationNo (Deterministic).No.Subword Regularization: Can sample different splits during training (adds noise/robustness).
Vocabulary InitSmall (chars/bytes) → Grow.Small → Grow.Large (all substrings) → Shrink.
Token SelectionFrequency-based.PMI-based (Pointwise Mutual Information).Probability-based (unigram language model).
Fertility (avg tokens/word)Medium (~2.5-3.0).High (~3.0-3.5).Low (~2.0) - best compression.
MorphologyLess interpretable.Moderate.Best - produces more morphologically interpretable tokens.
Librarytiktoken, HuggingFace.HuggingFace.SentencePiece (default).

2025 Research Insights

Unigram outperforms BPE on morphology preservation:

  • Bostrom & Durrett (2020): Unigram produces more morphologically interpretable tokens
  • Example: destabilizing → Unigram: de + stabilizing, BPE: dest + abil + iz + ing
  • Downstream impact: Models trained on Unigram tokens show better fine-tuning performance

When to use each:

  • BPE: Default choice, efficient, widely adopted (GPT, Llama)
  • WordPiece: BERT-style models, when you need PMI-based merging
  • Unigram: Multilingual models, morphologically rich languages (Arabic, Turkish, Finnish), when compression matters

Note: Most generative models (GPT family, Llama) use BPE because it's standard and efficient. T5 uses SentencePiece (Unigram) which handles multilingual text slightly better.


The "Strawberry" Problem

Why does GPT-4 fail to count the 'r's in "Strawberry"?

Answer: Because it never sees the word "Strawberry". It sees the token ID.

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
print(enc.encode("Strawberry"))
# Output: [9241, 8075] -> corresponds to ["Straw", "berry"]

The model receives [ID_1, ID_2].

  • ID_1 ("Straw") vector: contains semantic concept of "dried stalk", "drinking tube".
  • ID_2 ("berry") vector: contains semantic concept of "small fruit".

Unless the model has memorized the spelling of every token ID during training (which it tries to do, but imperfectly), it cannot "count" letters.

Implication for Interviews:

  • Don't ask LLMs to perform character-level manipulation (reversing strings, cyphers) without tools.
  • This is a fundamental architectural limitation, not just "bad training".
  • Workaround: Use tools/code for character-level tasks, not raw LLM inference.

2025 Update: Strawberry Benchmark

Different tokenizers handle this differently:

# GPT-4o (cl100k_base)
# "Strawberry" -> [9241, 8075] (["Straw", "berry"])
# Can't count r's: 3 tokens, 'r' is split across them

# Llama 3 (uses GPT-4 tokenizer)
# "strawberry" -> [49607, 8698, 11, 8205] (["str", "aw", "berry", "."])
# Still split, but different boundaries

# Claude 3.5 (custom tokenizer)
# "strawberry" -> [9900, 12072, 9177] (["str", "aw", "berry"])
# Similar limitation

No modern tokenizer solves this - it's inherent to subword tokenization.


Technical Deep Dive

1. Pre-tokenization

Before BPE runs, text is normalized.

Unicode Normalization:

  • NFC (Canonical Composition): é as single character (U+00E9)
  • NFD (Canonical Decomposition): e + ´ (U+0065 U+0301)
  • Impact: Affects tokenization boundaries and vocabulary size

Splitting Rules:

  • GPT-4 splits on ' (apostrophes) and spaces
  • Ensures punctuation is handled consistently
  • Example: "don't"["don", "'", "t"] or ["do", "n't"] depending on training

2. Space Handling

Approaches differ by tokenizer:

TokenizerSpace RepresentationExample
SentencePiece (Llama/T5)Treats space as character (often _ or <0x20>)" Hello"_Hello
Tiktoken (GPT)Spaces are part of the token" Hello" Hello
WordPiece (BERT)Uses ## for continuations" Hello"Hello (no leading space token)

Implication: " hello" and "hello" have different IDs. This is why prompts are sensitive to trailing spaces.

2025 Update:

  • Most modern tokenizers use byte-level BPE where space is just byte 0x20
  • Avoids special handling, more consistent across languages

3. Vocabulary Size Trade-offs

Why not use 1 million tokens?

Embedding Matrix Size: V×dmodelV \times d_{model}

  • A 100k vocab with 4096 dimensions = 400M parameters just for embeddings!
  • A 32k vocab with 4096 dimensions = 131M parameters

Diminishing Returns:

  • Rare tokens are seen so infrequently the model doesn't learn good embeddings
  • Optimal range: 32k-100k for most models
  • Llama 2: 32k vocab
  • GPT-2: 50k vocab
  • GPT-4o: 100k vocab (cl100k_base)
  • Llama 3: 128k vocab

2025 Research:

  • Ali et al. (2024): 33k and 50k vocabularies performed better on English tasks than larger sizes
  • Multilingual trade-off: Larger vocabs (100k+) needed for multilingual models
  • Domain-specific: Code models benefit from larger vocabs (150k+ for programming tokens)

4. Token Efficiency by Language

Not all languages tokenize equally:

LanguageTokens per Word (approx)Efficiency
English0.75-1.0 tokens/word★★★★★ (Most efficient)
Spanish/French/German1.2-1.5 tokens/word★★★★☆
Chinese/Japanese/Korean2.0-3.0 tokens/word★★★☆☆
Arabic/Hebrew2.5-3.5 tokens/word★★☆☆☆
Thai/Lao/Khmer3.0-4.0 tokens/word★★☆☆☆
Code (programming)0.5-1.5 tokens/token★★★★☆ (depends on language)

Implication:

  • API usage is more expensive for non-English languages
  • Same prompt in Chinese can cost 3x more than in English
  • Workaround: Use language-specific tokenizers or compression

Special Tokens Map

Knowing these is crucial for debugging raw model inputs.

Token TypeGPT-4oLlama 3/4Explanation
BOS (Start)-`<begin_of_text
EOS (End)`<endoftext>`
PAD--Used for batching (making all sequences same length).
Role Start-`<start_header_id
Role End-`<eot_id
Image`<image>`

2025 Update:

  • Modern models use special token tuples instead of single tokens
  • Example: Llama 3 uses <|start_header_id|>user<|end_header_id|> for role marking
  • Purpose: Enables fine-grained control over conversation structure

Security: Tokenization Attacks

Prompt Injection via Token Splitting: Adversaries can bypass safety filters by splitting forbidden words into unusual tokens that the safety filter (often a simpler classifier) doesn't recognize, but the LLM reconstructs.

Example: If "bomb" is banned:

  • User Input: "b" + "omb"
  • Tokenizer: [ID_b, ID_omb]
  • Safety Filter: "I don't see 'bomb'".
  • LLM: Concatenates embeddings → "bomb".

2025 Attack Vectors

Unicode Homoglyphs:

  • Uses visually similar characters from different scripts
  • Example: "аdmin" (Cyrillic 'а') vs "admin" (Latin 'a')
  • Tokenizers handle these differently, potentially bypassing filters

Token Smuggling:

  • Break malicious content across token boundaries
  • Example: "D<|ROT|>ROP" where <|ROT|> is a special token
  • After tokenization, reconstructs to "DROP"

Defense Strategies:

  1. Normalization: Normalize Unicode before tokenization (NFC/NFD)
  2. Token-level filtering: Apply safety at token level, not string level
  3. Adversarial training: Train on token-split attacks during alignment

2025: Performance Optimizations

BlockBPE (Parallel BPE Tokenization)

Problem: BPE is inherently sequential - must apply merge rules in order.

Solution: BlockBPE processes tokenization in parallel blocks.

  • Speedup: 3-5x faster for long texts
  • Trade-off: Minor quality loss in math/code tasks
  • Status: Research stage (arXiv:2507.11941)

GPU Tokenization

Problem: CPU tokenization becomes bottleneck at high throughput.

Solution: Move tokenization to GPU.

  • Libraries: TensorRT-LLM, vLLM exploring GPU tokenizers
  • Challenge: Requires major architecture changes
  • 2025 Status: Early research, not production-ready

Token Caching

Technique: Cache tokenization results for common prompts.

  • System prompts: Cache system prompt tokenization
  • Templates: Cache prompt templates with variables
  • Savings: 10-30% latency reduction for chat applications

Libraries and Tools

tiktoken (OpenAI)

Why use it:

  • 3-6x faster than HuggingFace tokenizers
  • Rust-based (via tiktoken-rs bindings)
  • Standard for GPT-2/3/4 models
import tiktoken

# Load tokenizer
enc = tiktoken.encoding_for_model("gpt-4o")

# Encode text
tokens = enc.encode("Hello, world!")
print(tokens) # [9906, 11, 1917, 0]

# Count tokens
count = len(tokens)
print(f"Token count: {count}")

# Decode back to text
text = enc.decode(tokens)
print(text) # "Hello, world!"

2025 Update: Now available in R, Go, JavaScript, Rust via community bindings.

HuggingFace Tokenizers

Why use it:

  • Most comprehensive: Supports BPE, WordPiece, Unigram
  • Production-ready: Written in Rust, Python bindings
  • Integration: Works seamlessly with Transformers library
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")

# Encode
tokens = tokenizer.encode("Hello, world!")
print(tokens) # [1, 9906, 11, 1917, 2] (with BOS/EOS)

# Fast tokenization
# Uses Rust backend, very fast
inputs = tokenizer(["Hello", "world"], padding=True, return_tensors="pt")

SentencePiece (Google)

Why use it:

  • Language-agnostic: Treats text as raw byte stream
  • Multilingual: Excellent for non-space languages (Chinese, Japanese, Thai)
  • Unigram + BPE: Implements both algorithms
import sentencepiece as spm

# Train tokenizer
spm.SentencePieceTrainer.train(
input='corpus.txt',
model_prefix='m',
vocab_size=32000,
model_type='unigram', # or 'bpe', 'char', 'word'
user_defined_symbols=['<user>', '<assistant>']
)

# Load and use
sp = spm.SentencePieceProcessor()
sp.load('m.model')
tokens = sp.encode("Hello, world!")
print(tokens) # [1532, 12, 2359, 37]

Spring AI Tokenization API

Spring AI provides tokenization utilities for estimating costs and managing context windows in production applications.

Token Counting Service

// Token counting with Spring AI
@Service
public class TokenizationService {
private final Tokenizer tokenizer;

public int countTokens(String text) {
return tokenizer.count(text);
}

// Demonstration of the "Strawberry problem"
public void demonstrateTokenizationIssue() {
String text = "Strawberry";
int count = tokenizer.count(text); // May return 2, not 10
// Tokens: ["Straw", "berry"] - model doesn't see individual letters
// This is why LLMs struggle with character-level tasks
}

// Cost estimation before API call
public CostEstimate estimateCost(String prompt, String model) {
int promptTokens = tokenizer.count(prompt);
int estimatedOutput = promptTokens / 2; // Rough estimate
int totalTokens = promptTokens + estimatedOutput;

return new CostEstimate(
model,
totalTokens,
pricingService.calculate(model, totalTokens)
);
}
}

Cost Optimization Strategies

// Service for optimizing token usage
@Service
public class CostOptimizationService {
private final Tokenizer tokenizer;
private final ChatClient chatClient;

// Truncate prompt to fit context window
public String fitInContext(String longPrompt, int maxTokens) {
int currentTokens = tokenizer.count(longPrompt);

if (currentTokens <= maxTokens) {
return longPrompt;
}

// Calculate how much to truncate
double ratio = (double) maxTokens / currentTokens;
int targetLength = (int) (longPrompt.length() * ratio);

// Truncate and verify
String truncated = longPrompt.substring(0, targetLength);
while (tokenizer.count(truncated) > maxTokens && targetLength > 0) {
targetLength -= 100;
truncated = longPrompt.substring(0, Math.max(0, targetLength));
}

return truncated;
}

// Batch processing with token budgeting
public List<String> processBatch(List<String> inputs, int maxTokensPerRequest) {
List<String> results = new ArrayList<>();

for (String input : inputs) {
int tokens = tokenizer.count(input);
if (tokens > maxTokensPerRequest) {
// Skip or truncate
String truncated = fitInContext(input, maxTokensPerRequest - 100);
results.add(processWithTruncationWarning(truncated));
} else {
results.add(chatClient.prompt().user(input).call().content());
}
}

return results;
}
}

Handling Multilingual Input in Production

// Multilingual token counting and cost estimation
@Service
public class MultilingualTokenService {
private final Tokenizer tokenizer;

// Estimate tokens for different languages
public LanguageEstimate estimateByLanguage(String text, String language) {
int tokens = tokenizer.count(text);
int words = text.split("\\s+").length;

// Language-specific efficiency factors
double tokensPerWord = switch (language.toLowerCase()) {
case "english" -> 0.75;
case "spanish", "french", "german" -> 1.3;
case "chinese", "japanese", "korean" -> 2.5;
case "arabic", "hebrew" -> 3.0;
default -> 1.5;
};

double expectedTokens = words * tokensPerWord;
double efficiency = expectedTokens / tokens; // Higher is better

return new LanguageEstimate(
language,
tokens,
words,
tokensPerWord,
efficiency
);
}

// Warn users about multilingual costs
public String getCostWarning(String text, String language) {
LanguageEstimate estimate = estimateByLanguage(text, language);

if (estimate.efficiency() < 0.5) {
return String.format(
"Warning: %s is less token-efficient than English. " +
"This text uses %.2f tokens/word (vs 0.75 for English). " +
"Estimated cost: %.1fx higher.",
language,
estimate.tokensPerWord(),
1.0 / estimate.efficiency()
);
}
return "Token usage is within expected range.";
}
}

Token Budget Management

// Managing token budgets across requests
@Component
public class TokenBudgetManager {
private final Tokenizer tokenizer;
private final Map<String, Integer> userBudgets = new ConcurrentHashMap<>();

// Check if user has budget for request
public boolean hasBudget(String userId, String prompt) {
int tokens = tokenizer.count(prompt);
Integer remaining = userBudgets.getOrDefault(userId, 10000);
return remaining >= tokens;
}

// Deduct tokens from user budget
public void deductTokens(String userId, String prompt, String response) {
int totalTokens = tokenizer.count(prompt) + tokenizer.count(response);
userBudgets.merge(userId, -totalTokens, Integer::sum);
}

// Get remaining budget
public int getRemainingBudget(String userId) {
return userBudgets.getOrDefault(userId, 10000);
}
}

Summary for Interviews

  1. LLMs don't read text, they read integer IDs produced by BPE (or Unigram/WordPiece).
  2. BPE balances vocabulary size vs sequence length, but Unigram produces more morphologically interpretable tokens.
  3. Tokenization artifacts cause failures in math, spelling, and reversing strings (the "Strawberry" problem).
  4. Vocab size is a trade-off: larger vocab = shorter sequences (faster inference) but more parameters (VRAM usage). Optimal range: 32k-100k.
  5. Multi-lingual: English ~0.75 words/token. Other languages are less efficient (more tokens/word), making API usage more expensive.
  6. Byte-level BPE (2025 standard): Base vocabulary of 256 bytes, handles all Unicode without OOV errors.
  7. tiktoken is 3-6x faster than alternatives, becoming de facto standard.
  8. Security: Token splitting enables prompt injection attacks - defend with normalization and token-level filtering.
  9. Performance: BlockBPE and GPU tokenization are emerging optimizations for 2025+.
Practice

Use tiktoken in Python to inspect how different strings are broken down. It builds intuition for why prompts fail.

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

# Test different languages
texts = [
"Hello world", # English
"Bonjour le monde", # French
"你好世界", # Chinese
"مرحبا بالعالم", # Arabic
]

for text in texts:
tokens = enc.encode(text)
print(f"{text:20}{len(tokens)} tokens: {tokens}")

Also explore the interactive tiktoken app to see tokenization in real-time.