When you send text to a language model, it doesn't read characters or words — it reads tokens. The algorithm that converts raw text into tokens is called a tokenizer, and different models use different tokenization strategies. This is why the same sentence can produce different token counts depending on which model you're using.
Byte Pair Encoding (BPE)
BPE is the most widely used tokenization algorithm in modern LLMs. OpenAI's GPT family, Meta's LLaMA, and Anthropic's Claude all use variants of BPE.
How BPE Learns Its Vocabulary
BPE starts with individual bytes (or characters) and iteratively merges the most frequent adjacent pair into a new token. This process repeats until the vocabulary reaches a target size.
For example, given the training text "aabaabaab":
- Step 1: Most frequent pair is
a + a→ merge intoaa - Step 2: Most frequent pair is
aa + b→ merge intoaab - Result:
"aab aab aab"→ 3 tokens instead of 9 characters
In practice, GPT-4's tokenizer (cl100k_base) has a vocabulary of about 100,000 tokens. GPT-4o's tokenizer (o200k_base) expanded this to 200,000, which improved efficiency especially for non-English languages and code.
WordPiece
WordPiece is used by Google's BERT and related models. It's similar to BPE but differs in how it selects which pairs to merge.
The Key Difference
Instead of merging the most frequent pair, WordPiece merges the pair that maximizes the likelihood of the training data. This means it prefers merges that create tokens appearing in many different contexts, not just tokens that appear many times in the same context.
WordPiece also uses a special prefix ## to indicate that a token is a continuation of a previous token rather than the start of a new word. For example, "playing" might tokenize as ["play", "##ing"].
SentencePiece
SentencePiece, developed by Google, takes a different approach entirely. Instead of requiring pre-tokenized (whitespace-split) input, it treats the input as a raw stream of characters, including spaces.
Why This Matters
Languages like Japanese, Chinese, and Thai don't use spaces between words. Traditional tokenizers that split on whitespace first would fail on these languages. SentencePiece handles them natively because it never assumes spaces are word boundaries.
SentencePiece can use either BPE or a unigram language model internally. Google's T5, Gemini, and many multilingual models use SentencePiece with a unigram model.
Which Models Use Which?
- BPE: GPT-3.5, GPT-4, GPT-4o, Claude, LLaMA, Mistral, Codex
- WordPiece: BERT, DistilBERT, ELECTRA
- SentencePiece (Unigram): T5, Gemini, PaLM, ALBERT, XLNet
- SentencePiece (BPE): LLaMA (uses SentencePiece with BPE mode)
Why the Same Text Gives Different Counts
Each tokenizer has its own learned vocabulary. The sentence "The quick brown fox" might be 4 tokens in one model and 5 in another, depending on whether "quick" is a single token or split into qu + ick.
Vocabulary size also matters. GPT-4o's 200K vocabulary tokenizes many words as single tokens that GPT-4's 100K vocabulary would split. This means GPT-4o is often more token-efficient for the same text.
Always count tokens using the specific tokenizer for your target model. A generic word count or character count will not give you accurate results.
Practical Impact
The tokenizer difference means you can't simply compare "128K context" across models. 128K tokens in GPT-4 holds a different amount of text than 128K tokens in Gemini. When evaluating models for long-document tasks, convert your actual documents to each model's token count to make a fair comparison.