Designing a Token Budget System for AI Applications

Once your AI application moves past prototyping, you need a system to track token usage, enforce spending limits, and route requests to the most cost-effective model. Without it, a single runaway feature or a spike in traffic can blow through your monthly budget in hours. Here's how to architect a token budget system from the ground up.

Core Architecture

A token budget system has four components:

Usage Tracker: Records every token consumed, broken down by user, feature, and model
Budget Enforcer: Checks limits before each request and rejects calls that would exceed the budget
Model Router: Selects the cheapest model that can handle the request
Dashboard: Provides visibility into spending patterns and alerts

1. Usage Tracker

Every LLM call should pass through a wrapper that logs token usage:

from dataclasses import dataclass
from datetime import datetime

@dataclass
class TokenUsageRecord:
    timestamp: datetime
    user_id: str
    feature: str
    model: str
    input_tokens: int
    output_tokens: int
    cost_usd: float

class UsageTracker:
    # Pricing per million tokens
    PRICING = {
        "gpt-4o":      {"input": 2.50, "output": 10.00},
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        "claude-sonnet":{"input": 3.00, "output": 15.00},
    }

    def __init__(self, storage):
        self.storage = storage  # DB, Redis, etc.

    def record(self, user_id, feature, model,
               input_tokens, output_tokens):
        pricing = self.PRICING[model]
        cost = (
            input_tokens * pricing["input"] / 1_000_000 +
            output_tokens * pricing["output"] / 1_000_000
        )
        record = TokenUsageRecord(
            timestamp=datetime.utcnow(),
            user_id=user_id,
            feature=feature,
            model=model,
            input_tokens=input_tokens,
            output_tokens=output_tokens,
            cost_usd=cost,
        )
        self.storage.insert(record)
        return record

2. Budget Enforcer

Check budgets at multiple levels — per-user, per-feature, and global — before allowing a request:

class BudgetEnforcer:
    def __init__(self, storage):
        self.storage = storage
        self.limits = {
            "user_daily_usd": 5.00,
            "feature_daily_usd": 100.00,
            "global_daily_usd": 500.00,
        }

    def check(self, user_id, feature, estimated_cost):
        """Raise if any budget would be exceeded."""
        # Per-user daily limit
        user_spent = self.storage.get_daily_spend(
            user_id=user_id
        )
        if user_spent + estimated_cost > self.limits["user_daily_usd"]:
            raise BudgetExceeded(
                f"User {user_id} daily limit reached: "
                f"${user_spent:.2f} / "
                f"${self.limits['user_daily_usd']:.2f}"
            )

        # Per-feature daily limit
        feature_spent = self.storage.get_daily_spend(
            feature=feature
        )
        if feature_spent + estimated_cost > self.limits["feature_daily_usd"]:
            raise BudgetExceeded(
                f"Feature '{feature}' daily limit reached"
            )

        # Global daily limit
        global_spent = self.storage.get_daily_spend()
        if global_spent + estimated_cost > self.limits["global_daily_usd"]:
            raise BudgetExceeded("Global daily budget exceeded")

3. Model Router

Route each request to the cheapest model that meets the quality requirements:

class ModelRouter:
    MODELS = [
        {
            "name": "gpt-4o-mini",
            "cost_per_1k": 0.00015 + 0.0006,
            "max_context": 128_000,
            "capabilities": ["classification", "extraction",
                           "summarization", "simple_qa"],
        },
        {
            "name": "gpt-4o",
            "cost_per_1k": 0.0025 + 0.01,
            "max_context": 128_000,
            "capabilities": ["classification", "extraction",
                           "summarization", "simple_qa",
                           "complex_reasoning", "code_gen"],
        },
        {
            "name": "claude-sonnet",
            "cost_per_1k": 0.003 + 0.015,
            "max_context": 200_000,
            "capabilities": ["classification", "extraction",
                           "summarization", "simple_qa",
                           "complex_reasoning", "code_gen",
                           "long_context"],
        },
    ]

    def select(self, task_type, input_tokens):
        """Pick the cheapest model that can handle this task."""
        candidates = [
            m for m in self.MODELS
            if task_type in m["capabilities"]
            and input_tokens < m["max_context"] - 4096
        ]
        if not candidates:
            raise NoSuitableModel(
                f"No model supports '{task_type}' "
                f"with {input_tokens} tokens"
            )
        # Sort by cost, pick cheapest
        return min(candidates, key=lambda m: m["cost_per_1k"])

4. Putting It Together

Wrap everything into a single gateway that your application calls instead of the raw API:

class AIGateway:
    def __init__(self):
        self.tracker = UsageTracker(storage)
        self.enforcer = BudgetEnforcer(storage)
        self.router = ModelRouter()

    async def complete(self, user_id, feature, task_type,
                       messages, **kwargs):
        # 1. Count input tokens
        input_tokens = count_tokens(messages)

        # 2. Route to cheapest suitable model
        model_config = self.router.select(
            task_type, input_tokens
        )

        # 3. Estimate cost and check budget
        estimated_cost = estimate_cost(
            model_config, input_tokens, max_output=4096
        )
        self.enforcer.check(user_id, feature, estimated_cost)

        # 4. Make the API call
        response = await call_llm(
            model_config["name"], messages, **kwargs
        )

        # 5. Record actual usage
        self.tracker.record(
            user_id, feature, model_config["name"],
            response.usage.prompt_tokens,
            response.usage.completion_tokens,
        )

        return response

Alerting and Monitoring

Set up alerts at key thresholds to catch problems early:

50% of daily budget: Informational alert — check if usage is on track
80% of daily budget: Warning — investigate if this is expected
Single request > $1: Immediate alert — likely a bug or abuse
Per-user spike: Alert when a user's hourly usage exceeds 10x their average

Store usage data in a time-series database or append-only log. You'll want to query by user, feature, model, and time range for cost attribution and optimization.

Key Design Decisions

Pre-check vs post-check: Always check budgets before the API call. Post-check only catches overages after you've already spent the money.
Estimated vs actual cost: Use estimated cost for budget checks (fast), record actual cost from the API response (accurate).
Graceful degradation: When a budget is exceeded, don't just error. Downgrade to a cheaper model, reduce max_tokens, or queue the request for later.

A token budget system isn't optional at scale — it's the difference between a predictable $500/month bill and a surprise $5,000 invoice.