Why Non-English Text Uses More Tokens

If you've ever counted tokens for Japanese, Chinese, Korean, Arabic, or Hindi text, you've probably noticed something frustrating: the same meaning expressed in these languages uses 2 to 4 times more tokens than English. This isn't a bug — it's a direct consequence of how tokenizers are trained.

The Root Cause: Training Data Bias

Tokenizers learn their vocabulary from training data. Since most LLM training corpora are heavily weighted toward English (often 50–80% of the data), the tokenizer learns to represent English text very efficiently. Common English words like "the," "and," and "information" each become single tokens.

For languages with different scripts — Chinese characters, Japanese kanji and kana, Arabic script, Devanagari — the tokenizer has seen far fewer examples. It hasn't learned to merge character sequences into efficient tokens for these languages, so it falls back to encoding them as individual characters or even individual bytes.

Real Examples of the Token Gap

The sentence "Artificial intelligence is transforming the world" in different languages (using GPT-4o's tokenizer):

English: "Artificial intelligence is transforming the world" → ~7 tokens
Spanish: "La inteligencia artificial está transformando el mundo" → ~10 tokens
Japanese: "人工知能は世界を変革しています" → ~14 tokens
Chinese: "人工智能正在改变世界" → ~10 tokens
Arabic: "الذكاء الاصطناعي يغير العالم" → ~16 tokens
Hindi: "कृत्रिम बुद्धिमत्ता दुनिया को बदल रही है" → ~25 tokens

Hindi uses roughly 3.5x more tokens than English for the same meaning. This directly translates to 3.5x higher API costs.

Why Some Languages Are Worse Than Others

Character Set Size

English uses 26 letters. Chinese has thousands of characters. A tokenizer with a 100K vocabulary can dedicate many entries to common English words and subwords, but it can't possibly have entries for all Chinese character combinations. Each Chinese character often becomes its own token or gets split into bytes.

Script Complexity

Languages using Latin script (Spanish, French, German) fare better because they share many subword patterns with English. The tokenizer has learned common Latin-script sequences. Languages with unique scripts (Thai, Tamil, Georgian) get the worst efficiency because the tokenizer has minimal exposure to their character patterns.

Morphological Complexity

Agglutinative languages like Turkish, Finnish, and Korean build long words by combining many morphemes. A single Turkish word like "evlerinizden" (from your houses) might be one concept but gets split into many tokens because the tokenizer hasn't learned these specific combinations.

The GPT-4o Improvement

OpenAI's GPT-4o tokenizer (o200k_base) made significant improvements for non-English languages compared to GPT-4's cl100k_base. By doubling the vocabulary size to 200,000 tokens and training on more multilingual data, GPT-4o reduced token counts for many languages by 20–40%.

For example, Hindi text that used 4x more tokens than English with GPT-4 now uses about 2.5x more with GPT-4o. It's better, but the gap still exists.

Mitigation Strategies

Choose models with better multilingual tokenizers: GPT-4o and Gemini models generally handle non-English text more efficiently than older models.
Keep system prompts in English: Even if the user interaction is in another language, writing your system prompt in English saves tokens since it's sent with every request.
Use prompt caching: Caching is especially valuable for non-English applications because the per-token savings are amplified by the higher token counts.
Consider translation pipelines: For some workflows, translating to English for processing and back to the target language can actually be cheaper than processing in the original language — though this adds latency and potential translation errors.
Budget accordingly: If your application serves non-English users, multiply your English-based cost estimates by 2–3x to get realistic projections.

The multilingual token gap is narrowing with each new model generation, but it's still significant enough to impact architecture and cost decisions for global applications.