How Images Are Tokenized in Multimodal Models

When you send an image to GPT-4o, Claude, or Gemini, the model doesn't "see" pixels directly. The image gets converted into tokens — and the number of tokens depends on the image resolution, the model's tiling strategy, and the detail level you choose. A single high-resolution image can cost as many tokens as several pages of text.

GPT-4o: Tile-Based Tokenization

OpenAI uses a tile system for image tokenization. The image is divided into 512×512 pixel tiles, and each tile costs a fixed number of tokens:

Low detail: The image is resized to fit 512×512. Fixed cost of 85 tokens regardless of original size.
High detail: The image is scaled so the longest side is 2048px and the shortest is at least 768px. It's then divided into 512×512 tiles. Each tile costs 170 tokens, plus a base of 85 tokens.

The formula for high detail: 85 + 170 × (number of tiles)

Examples of high-detail token costs:

512×512 (1 tile): 85 + 170 = 255 tokens
1024×1024 (4 tiles): 85 + 680 = 765 tokens
2048×1024 (8 tiles): 85 + 1,360 = 1,445 tokens
2048×2048 (16 tiles): 85 + 2,720 = 2,805 tokens

You control the detail level in the API call:

message = {
    "role": "user",
    "content": [
        {"type": "text", "text": "What's in this image?"},
        {
            "type": "image_url",
            "image_url": {
                "url": "https://example.com/photo.jpg",
                "detail": "low"  # or "high" or "auto"
            }
        }
    ]
}

Claude: Resolution-Based Calculation

Anthropic calculates image tokens based on the image dimensions after resizing. Claude resizes images so neither dimension exceeds 1568px, then calculates tokens:

tokens = (width × height) / 750

Examples:

200×200: ~53 tokens
1000×1000: ~1,334 tokens
1568×1568: ~3,279 tokens (maximum)

Claude doesn't have a "low detail" mode — every image is processed at its full (resized) resolution. The maximum cost per image is about 3,300 tokens.

Gemini: Token-Efficient Vision

Google's Gemini models use a fixed token cost per image that's independent of resolution:

Gemini 1.5 Flash / Pro: 258 tokens per image (fixed)

This makes Gemini the most predictable for budgeting — every image costs the same regardless of size. However, very high-resolution images may lose detail since the model processes them at a fixed internal resolution.

Cost Comparison: One Image

For a 1024×1024 image:

GPT-4o (high detail): 765 tokens × $2.50/M = $0.0019
GPT-4o (low detail): 85 tokens × $2.50/M = $0.0002
Claude 3.5 Sonnet: ~1,334 tokens × $3.00/M = $0.004
Gemini 1.5 Pro: 258 tokens × $1.25/M = $0.0003

At scale — say 100,000 images per day — the difference between low and high detail on GPT-4o is $17/day vs $190/day.

Optimization Tips

1. Use Low Detail When Possible

For tasks like image classification, detecting objects, or reading large text in images, low detail (85 tokens) is often sufficient. Reserve high detail for tasks requiring fine-grained analysis like reading small text, identifying subtle defects, or analyzing detailed charts.

2. Resize Before Sending

The API resizes images internally, but you're still uploading the full file. Resize client-side to save bandwidth and ensure you know the exact token cost:

from PIL import Image

def optimize_for_vision(image_path, max_size=1024):
    img = Image.open(image_path)
    img.thumbnail((max_size, max_size))
    img.save("optimized.jpg", quality=85)
    return "optimized.jpg"

3. Crop to the Region of Interest

If you only need the model to analyze part of an image, crop it first. A 2048×2048 screenshot cropped to the relevant 512×512 section drops from 2,805 tokens to 255 tokens — a 91% reduction.

4. Batch Multiple Small Images

If you're analyzing many small images (thumbnails, icons), consider compositing them into a single larger image with labels. One 1024×1024 composite of 16 thumbnails costs 765 tokens. Sending them individually at low detail would cost 16 × 85 = 1,360 tokens.

Image tokens add up fast. Always use the lowest detail level that works for your task, resize before uploading, and crop to the region that matters.