Prompt Caching: Cut Your Token Costs by 90%

Prompt caching is one of the most impactful cost-saving features available from major AI providers. When your requests share a common prefix — like a system prompt or reference document — the provider can skip re-processing those tokens and charge you a fraction of the normal price. The savings can reach 50–90% on input tokens.

How Prompt Caching Works

The core idea is simple: if the beginning of your prompt is identical to a recent request, the provider has already computed the internal representations for those tokens. Instead of recomputing them, it reuses the cached result and only processes the new tokens at the end.

Think of it like a compiled template. The first request "compiles" the shared prefix. Subsequent requests reuse that compilation and only process the variable suffix.

Provider Implementations

OpenAI: Automatic Prompt Caching

OpenAI's caching is automatic — no code changes required. When consecutive requests share a prefix of at least 1,024 tokens, cached tokens are charged at 50% off the normal input price.

Minimum cacheable prefix: 1,024 tokens
Cache granularity: 128-token blocks
Discount: 50% on cached input tokens
Cache lifetime: 5–10 minutes of inactivity
No API changes needed — it just works

The response includes cached_tokens in the usage object so you can verify caching is working:

"usage": {
  "prompt_tokens": 2048,
  "prompt_tokens_details": {
    "cached_tokens": 1920
  },
  "completion_tokens": 150
}

Anthropic: Explicit Cache Control

Anthropic gives you explicit control over what gets cached using cache_control breakpoints. Cached tokens are charged at 90% off the normal input price, but there's a 25% surcharge on the first request that writes to the cache.

Minimum cacheable content: 1,024 tokens (Claude 3.5 Sonnet), 2,048 tokens (Claude 3 Opus)
Discount: 90% on cached reads, 25% surcharge on cache writes
Cache lifetime: 5 minutes (refreshed on each hit)
Up to 4 cache breakpoints per request

# Anthropic cache control example
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    system=[{
        "type": "text",
        "text": long_system_prompt,
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": user_query}]
)

Google: Context Caching

Google's Gemini API offers explicit context caching where you create a named cache object and reference it in subsequent requests. Cached tokens are charged at 75% off.

Minimum cacheable content: 32,768 tokens
Discount: 75% on cached tokens
Cache lifetime: configurable (default 1 hour)
Additional storage cost per hour

# Google context caching example
import google.generativeai as genai

cache = genai.caching.CachedContent.create(
    model="gemini-1.5-pro",
    contents=[large_document],
    ttl=datetime.timedelta(hours=1)
)

model = genai.GenerativeModel.from_cached_content(cache)
response = model.generate_content("Summarize this document")

Structuring Requests for Cache Hits

The key to maximizing cache hits is keeping the static content at the beginning and the variable content at the end of your prompt.

Optimal Message Order

1. System prompt (static — cached)
2. Reference documents (static — cached)
3. Few-shot examples (static — cached)
4. Conversation history (semi-static)
5. Current user message (variable — not cached)

If you put the user's message before the reference documents, the cache breaks at the user message and nothing after it gets cached.

Common Mistakes

Timestamps in system prompts: Including the current date/time in your system prompt changes it every second, breaking the cache entirely.
Randomized few-shot examples: If you randomly select examples each request, the prefix changes and nothing gets cached.
User context before static content: Always put static content first.

When Caching Pays Off

Caching is most valuable when:

Your system prompt is long (1,000+ tokens)
You include reference documents in every request
You make many requests with the same prefix in a short time window
Your application has a high request volume

For a typical RAG application with a 5,000-token system prompt and 10,000-token reference context, prompt caching can reduce input costs from $0.0375 to $0.004 per request with Anthropic — a 90% savings.