Embedding Tokens vs LLM Tokens: What's the Difference?

If you're building a RAG (Retrieval-Augmented Generation) system, you're paying for two different types of tokens: embedding tokens for indexing and searching your documents, and LLM tokens for generating answers. They use different tokenizers, have different limits, and cost wildly different amounts. Confusing them leads to budget surprises and broken chunking strategies.

Different Tokenizers, Different Counts

Embedding models and LLMs don't always use the same tokenizer. The same text can produce different token counts depending on which model processes it:

OpenAI text-embedding-3-small/large: Uses the cl100k_base tokenizer (same as GPT-3.5/GPT-4)
GPT-4o: Uses the newer o200k_base tokenizer with a 200K vocabulary
Cohere embed-v3: Uses its own BPE tokenizer
Voyage AI: Uses a custom tokenizer optimized for code and text

Because o200k_base has a larger vocabulary than cl100k_base, the same text often produces fewer tokens with GPT-4o than with OpenAI's embedding models. A 500-token chunk measured with the embedding tokenizer might only be 420 tokens when sent to GPT-4o.

import tiktoken

text = "Retrieval-augmented generation combines search with LLMs"

# Embedding tokenizer (cl100k_base)
emb_enc = tiktoken.get_encoding("cl100k_base")
print(len(emb_enc.encode(text)))  # 8 tokens

# GPT-4o tokenizer (o200k_base)
llm_enc = tiktoken.get_encoding("o200k_base")
print(len(llm_enc.encode(text)))  # 9 tokens

The difference is usually small for English text but can be significant for code, URLs, or non-Latin scripts.

Pricing: Orders of Magnitude Apart

Embedding tokens are dramatically cheaper than LLM tokens:

text-embedding-3-small: $0.02 per million tokens
text-embedding-3-large: $0.13 per million tokens
GPT-4o input: $2.50 per million tokens (125x more than embedding-small)
GPT-4o output: $10.00 per million tokens (500x more than embedding-small)

This means embedding a million-document corpus is cheap — the expensive part is the LLM calls that use the retrieved chunks. A RAG query that retrieves 5 chunks of 500 tokens each adds 2,500 input tokens to your LLM call. At GPT-4o rates, that's $0.00625 per query just for the context.

Token Limits Are Different Too

Embedding models have their own input limits, separate from LLM context windows:

text-embedding-3-small/large: 8,191 tokens max input
Cohere embed-v3: 512 tokens max input
Voyage AI voyage-3: 32,000 tokens max input

If your chunks exceed the embedding model's limit, the API will either error or silently truncate. This is why chunk size must be calibrated to the embedding model's limit, not the LLM's context window.

Practical Guide for RAG Developers

Chunk Sizing

Size your chunks based on the embedding model's token limit, with headroom:

# For text-embedding-3-small (8,191 token limit)
MAX_CHUNK_TOKENS = 512   # Sweet spot for retrieval quality
OVERLAP_TOKENS = 50      # Overlap between chunks

# Count with the EMBEDDING tokenizer, not the LLM tokenizer
emb_enc = tiktoken.get_encoding("cl100k_base")

def chunk_text(text, max_tokens=MAX_CHUNK_TOKENS):
    tokens = emb_enc.encode(text)
    chunks = []
    for i in range(0, len(tokens), max_tokens - OVERLAP_TOKENS):
        chunk_tokens = tokens[i:i + max_tokens]
        chunks.append(emb_enc.decode(chunk_tokens))
    return chunks

Budget Planning

For a typical RAG application processing 10,000 queries per day with 5 retrieved chunks of 500 tokens each:

Embedding cost (queries only): 10,000 × 50 tokens × $0.02/M = $0.01/day
LLM input cost (chunks + prompt): 10,000 × 3,000 tokens × $2.50/M = $75/day
LLM output cost: 10,000 × 500 tokens × $10/M = $50/day

Embedding costs are negligible. The LLM costs dominate by 10,000x. Optimize your chunk retrieval to minimize the number and size of chunks passed to the LLM — that's where the money goes.

Re-ranking Saves LLM Tokens

Retrieve more chunks than you need (e.g., top 20), then use a re-ranker to select the best 3–5. Re-ranking costs a fraction of LLM tokens and ensures you only pass the most relevant context to the expensive generation step.

In RAG, embedding tokens are cheap and LLM tokens are expensive. Optimize the retrieval pipeline to minimize what gets sent to the LLM.