Few-Shot Prompting Without Blowing Your Token Budget

Few-shot prompting — including examples in your prompt to guide the model — is one of the most effective techniques in prompt engineering. It's also one of the most expensive. Each example adds 50–200 tokens to every request. With 5 examples at 100,000 requests per day, you're spending 25–100 million tokens daily just on examples. Here's how to get the same quality with fewer tokens.

How Many Examples Do You Actually Need?

The common assumption is "more examples = better results." Research and practice tell a different story:

0 examples (zero-shot): Works well for tasks the model already understands — summarization, translation, simple classification. Try this first.
1–2 examples: Usually sufficient to establish a format or pattern. The model picks up structure quickly.
3–5 examples: Needed for nuanced tasks where the model needs to understand subtle distinctions (e.g., your specific tone, edge cases in classification).
6+ examples: Rarely improves quality. If you need this many, consider fine-tuning instead — it moves the examples into the model weights and costs zero tokens per request.

Always benchmark. Run your task with 1, 3, and 5 examples and measure accuracy. You'll often find that 2 examples perform within 1–2% of 5 examples.

Compress Your Examples

Most few-shot examples are longer than they need to be. Compress them without losing the pattern:

Before (85 tokens per example):

User: I ordered a laptop on Monday and it still hasn't arrived.
I've been waiting for over a week now and I'm very frustrated.
Can someone please help me track my order?

Category: shipping_issue
Sentiment: negative
Priority: high

After (32 tokens per example):

In: Laptop ordered Monday, hasn't arrived, waiting over a week, frustrated
Out: shipping_issue | negative | high

The compressed version teaches the same pattern — the model learns the mapping from complaint text to structured output. You've cut 62% of the tokens per example.

Dynamic Few-Shot Selection

Instead of including the same static examples in every request, select examples dynamically based on the input. This gives the model the most relevant examples while keeping the count low.

import numpy as np
from openai import OpenAI

client = OpenAI()

# Pre-compute embeddings for your example bank
example_bank = [
    {"input": "...", "output": "...", "embedding": [...]},
    # ... hundreds of examples
]

def select_examples(user_input, k=2):
    """Pick the k most similar examples."""
    input_emb = client.embeddings.create(
        model="text-embedding-3-small",
        input=user_input
    ).data[0].embedding

    similarities = [
        np.dot(input_emb, ex["embedding"])
        for ex in example_bank
    ]
    top_k = np.argsort(similarities)[-k:]
    return [example_bank[i] for i in top_k]

# Build prompt with only relevant examples
examples = select_examples(user_query, k=2)
prompt = format_prompt(examples, user_query)

This approach gives you the quality of a large example bank with the token cost of just 2 examples per request. The embedding lookup costs fractions of a cent.

Use the System Message for Patterns

Move the pattern description into the system message and keep examples minimal:

# System message (sent once, cached with prompt caching)
Classify support tickets. Output format: category | sentiment | priority.
Categories: shipping_issue, billing, technical, account, other.
Sentiment: positive, neutral, negative.
Priority: low, medium, high.

Example:
In: Package damaged on arrival, want refund
Out: shipping_issue | negative | high

With prompt caching enabled (available on OpenAI and Anthropic), the system message tokens are cached after the first request. You pay full price once, then 50–90% less on subsequent requests. This makes a longer system message with one good example extremely cost-effective.

When to Switch to Fine-Tuning

If you're using 5+ examples and making more than 10,000 requests per day, fine-tuning almost certainly saves money. A fine-tuned model has the examples baked into its weights, so your prompt drops to zero-shot. The math:

Few-shot (5 examples × 80 tokens): 400 extra tokens × 10,000 requests = 4M tokens/day
Fine-tuned (zero-shot): 0 extra tokens, one-time training cost of ~$5–20

Fine-tuning pays for itself within hours at that volume.

Start with zero-shot. Add one example. Measure. Only add more examples if accuracy measurably improves — and compress every example you add.