Handling Token Limits Gracefully in Production

In development, hitting a token limit means a failed test. In production, it means a broken user experience, a lost customer request, or silent data loss. Building robust token limit handling is essential for any application that relies on LLM APIs. Here are the patterns that work.

The Three Failure Modes

Token limits can bite you in three ways:

Input too long: Your prompt exceeds the model's context window. The API returns a 400 error.
Output truncated: The model's response hits max_tokens and gets cut off mid-sentence. You get partial, unusable output.
Combined overflow: Input + output together exceed the context window. The model starts generating but runs out of space, producing a short or degraded response.

Pattern 1: Pre-flight Token Check

Count tokens before sending the request. If the input is too large, truncate or chunk it before the API call — not after.

import tiktoken

def safe_completion(messages, model="gpt-4o", max_output=4096):
    enc = tiktoken.encoding_for_model(model)
    context_limit = 128_000  # GPT-4o limit

    # Count input tokens
    input_tokens = sum(
        len(enc.encode(m["content"])) + 4  # message overhead
        for m in messages
    ) + 3  # reply priming

    available = context_limit - input_tokens
    if available < max_output:
        if available < 100:
            raise TokenBudgetExceeded(
                f"Input uses {input_tokens} tokens, "
                f"only {available} left for output"
            )
        max_output = available  # Reduce output budget

    return client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=max_output
    )

Pattern 2: Smart Truncation

When input is too long, don't just chop off the end. Different content types need different truncation strategies:

def truncate_to_budget(text, max_tokens, model="gpt-4o",
                       strategy="tail"):
    enc = tiktoken.encoding_for_model(model)
    tokens = enc.encode(text)

    if len(tokens) <= max_tokens:
        return text

    if strategy == "tail":
        # Keep the end (good for conversations)
        tokens = tokens[-max_tokens:]
    elif strategy == "head":
        # Keep the beginning (good for documents)
        tokens = tokens[:max_tokens]
    elif strategy == "middle_out":
        # Keep start and end, drop middle
        half = max_tokens // 2
        tokens = tokens[:half] + tokens[-half:]

    return enc.decode(tokens)

For conversations, keep the most recent messages (tail). For documents, keep the beginning which usually has the most important context (head). For code, keep the function signature and the end where the logic concludes (middle_out).

Pattern 3: Cascading Model Fallback

When a request is too large for one model, fall back to a model with a larger context window:

MODEL_CASCADE = [
    {"model": "gpt-4o-mini", "limit": 128_000, "cost": "low"},
    {"model": "gpt-4o",      "limit": 128_000, "cost": "medium"},
    {"model": "gemini-1.5-pro", "limit": 2_000_000, "cost": "medium"},
]

async def completion_with_fallback(messages, input_tokens):
    for config in MODEL_CASCADE:
        if input_tokens < config["limit"] - 4096:
            try:
                return await call_model(
                    config["model"], messages
                )
            except TokenLimitError:
                continue
    raise AllModelsFailed("Input too large for all models")

Pattern 4: Detect Truncated Output

Always check the finish_reason in the API response. If it's "length" instead of "stop", the output was cut off:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    max_tokens=500
)

choice = response.choices[0]
if choice.finish_reason == "length":
    # Output was truncated — handle it
    logger.warning("Response truncated",
        extra={"usage": response.usage})

    # Option A: Retry with higher max_tokens
    # Option B: Ask model to continue
    # Option C: Return partial result with warning

Pattern 5: Token Budget Middleware

In production systems, wrap your LLM calls in middleware that enforces budgets and logs usage:

class TokenBudgetMiddleware:
    def __init__(self, daily_limit=10_000_000):
        self.daily_limit = daily_limit
        self.used_today = 0

    async def call(self, messages, **kwargs):
        input_tokens = self.count_tokens(messages)

        if self.used_today + input_tokens > self.daily_limit:
            raise DailyBudgetExceeded(
                f"Used {self.used_today:,} of "
                f"{self.daily_limit:,} daily tokens"
            )

        response = await self.client.create(
            messages=messages, **kwargs
        )
        total = response.usage.total_tokens
        self.used_today += total
        return response

Key Takeaways

Always count tokens before the API call, not after
Check finish_reason on every response to catch truncation
Use strategy-appropriate truncation, not blind character slicing
Build model fallback chains for handling oversized inputs
Log token usage per request for monitoring and alerting

The best production systems never let a token limit error reach the user. They anticipate, adapt, and degrade gracefully.