If you're building a RAG (Retrieval-Augmented Generation) system, you're paying for two different types of tokens: embedding tokens for indexing and searching your documents, and LLM tokens for generating answers. They use different tokenizers, have different limits, and cost wildly different amounts. Confusing them leads to budget surprises and broken chunking strategies.
Different Tokenizers, Different Counts
Embedding models and LLMs don't always use the same tokenizer. The same text can produce different token counts depending on which model processes it:
- OpenAI text-embedding-3-small/large: Uses the
cl100k_basetokenizer (same as GPT-3.5/GPT-4) - GPT-4o: Uses the newer
o200k_basetokenizer with a 200K vocabulary - Cohere embed-v3: Uses its own BPE tokenizer
- Voyage AI: Uses a custom tokenizer optimized for code and text
Because o200k_base has a larger vocabulary than cl100k_base, the same text often produces fewer tokens with GPT-4o than with OpenAI's embedding models. A 500-token chunk measured with the embedding tokenizer might only be 420 tokens when sent to GPT-4o.
import tiktoken
text = "Retrieval-augmented generation combines search with LLMs"
# Embedding tokenizer (cl100k_base)
emb_enc = tiktoken.get_encoding("cl100k_base")
print(len(emb_enc.encode(text))) # 8 tokens
# GPT-4o tokenizer (o200k_base)
llm_enc = tiktoken.get_encoding("o200k_base")
print(len(llm_enc.encode(text))) # 9 tokens
The difference is usually small for English text but can be significant for code, URLs, or non-Latin scripts.
Pricing: Orders of Magnitude Apart
Embedding tokens are dramatically cheaper than LLM tokens:
- text-embedding-3-small: $0.02 per million tokens
- text-embedding-3-large: $0.13 per million tokens
- GPT-4o input: $2.50 per million tokens (125x more than embedding-small)
- GPT-4o output: $10.00 per million tokens (500x more than embedding-small)
This means embedding a million-document corpus is cheap — the expensive part is the LLM calls that use the retrieved chunks. A RAG query that retrieves 5 chunks of 500 tokens each adds 2,500 input tokens to your LLM call. At GPT-4o rates, that's $0.00625 per query just for the context.
Token Limits Are Different Too
Embedding models have their own input limits, separate from LLM context windows:
- text-embedding-3-small/large: 8,191 tokens max input
- Cohere embed-v3: 512 tokens max input
- Voyage AI voyage-3: 32,000 tokens max input
If your chunks exceed the embedding model's limit, the API will either error or silently truncate. This is why chunk size must be calibrated to the embedding model's limit, not the LLM's context window.
Practical Guide for RAG Developers
Chunk Sizing
Size your chunks based on the embedding model's token limit, with headroom:
# For text-embedding-3-small (8,191 token limit)
MAX_CHUNK_TOKENS = 512 # Sweet spot for retrieval quality
OVERLAP_TOKENS = 50 # Overlap between chunks
# Count with the EMBEDDING tokenizer, not the LLM tokenizer
emb_enc = tiktoken.get_encoding("cl100k_base")
def chunk_text(text, max_tokens=MAX_CHUNK_TOKENS):
tokens = emb_enc.encode(text)
chunks = []
for i in range(0, len(tokens), max_tokens - OVERLAP_TOKENS):
chunk_tokens = tokens[i:i + max_tokens]
chunks.append(emb_enc.decode(chunk_tokens))
return chunks
Budget Planning
For a typical RAG application processing 10,000 queries per day with 5 retrieved chunks of 500 tokens each:
- Embedding cost (queries only): 10,000 × 50 tokens × $0.02/M = $0.01/day
- LLM input cost (chunks + prompt): 10,000 × 3,000 tokens × $2.50/M = $75/day
- LLM output cost: 10,000 × 500 tokens × $10/M = $50/day
Embedding costs are negligible. The LLM costs dominate by 10,000x. Optimize your chunk retrieval to minimize the number and size of chunks passed to the LLM — that's where the money goes.
Re-ranking Saves LLM Tokens
Retrieve more chunks than you need (e.g., top 20), then use a re-ranker to select the best 3–5. Re-ranking costs a fraction of LLM tokens and ensures you only pass the most relevant context to the expensive generation step.
In RAG, embedding tokens are cheap and LLM tokens are expensive. Optimize the retrieval pipeline to minimize what gets sent to the LLM.