When you send an image to GPT-4o, Claude, or Gemini, the model doesn't "see" pixels directly. The image gets converted into tokens — and the number of tokens depends on the image resolution, the model's tiling strategy, and the detail level you choose. A single high-resolution image can cost as many tokens as several pages of text.
GPT-4o: Tile-Based Tokenization
OpenAI uses a tile system for image tokenization. The image is divided into 512×512 pixel tiles, and each tile costs a fixed number of tokens:
- Low detail: The image is resized to fit 512×512. Fixed cost of 85 tokens regardless of original size.
- High detail: The image is scaled so the longest side is 2048px and the shortest is at least 768px. It's then divided into 512×512 tiles. Each tile costs 170 tokens, plus a base of 85 tokens.
The formula for high detail: 85 + 170 × (number of tiles)
Examples of high-detail token costs:
- 512×512 (1 tile): 85 + 170 = 255 tokens
- 1024×1024 (4 tiles): 85 + 680 = 765 tokens
- 2048×1024 (8 tiles): 85 + 1,360 = 1,445 tokens
- 2048×2048 (16 tiles): 85 + 2,720 = 2,805 tokens
You control the detail level in the API call:
message = {
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/photo.jpg",
"detail": "low" # or "high" or "auto"
}
}
]
}
Claude: Resolution-Based Calculation
Anthropic calculates image tokens based on the image dimensions after resizing. Claude resizes images so neither dimension exceeds 1568px, then calculates tokens:
tokens = (width × height) / 750
Examples:
- 200×200: ~53 tokens
- 1000×1000: ~1,334 tokens
- 1568×1568: ~3,279 tokens (maximum)
Claude doesn't have a "low detail" mode — every image is processed at its full (resized) resolution. The maximum cost per image is about 3,300 tokens.
Gemini: Token-Efficient Vision
Google's Gemini models use a fixed token cost per image that's independent of resolution:
- Gemini 1.5 Flash / Pro: 258 tokens per image (fixed)
This makes Gemini the most predictable for budgeting — every image costs the same regardless of size. However, very high-resolution images may lose detail since the model processes them at a fixed internal resolution.
Cost Comparison: One Image
For a 1024×1024 image:
- GPT-4o (high detail): 765 tokens × $2.50/M = $0.0019
- GPT-4o (low detail): 85 tokens × $2.50/M = $0.0002
- Claude 3.5 Sonnet: ~1,334 tokens × $3.00/M = $0.004
- Gemini 1.5 Pro: 258 tokens × $1.25/M = $0.0003
At scale — say 100,000 images per day — the difference between low and high detail on GPT-4o is $17/day vs $190/day.
Optimization Tips
1. Use Low Detail When Possible
For tasks like image classification, detecting objects, or reading large text in images, low detail (85 tokens) is often sufficient. Reserve high detail for tasks requiring fine-grained analysis like reading small text, identifying subtle defects, or analyzing detailed charts.
2. Resize Before Sending
The API resizes images internally, but you're still uploading the full file. Resize client-side to save bandwidth and ensure you know the exact token cost:
from PIL import Image
def optimize_for_vision(image_path, max_size=1024):
img = Image.open(image_path)
img.thumbnail((max_size, max_size))
img.save("optimized.jpg", quality=85)
return "optimized.jpg"
3. Crop to the Region of Interest
If you only need the model to analyze part of an image, crop it first. A 2048×2048 screenshot cropped to the relevant 512×512 section drops from 2,805 tokens to 255 tokens — a 91% reduction.
4. Batch Multiple Small Images
If you're analyzing many small images (thumbnails, icons), consider compositing them into a single larger image with labels. One 1024×1024 composite of 16 thumbnails costs 765 tokens. Sending them individually at low detail would cost 16 × 85 = 1,360 tokens.
Image tokens add up fast. Always use the lowest detail level that works for your task, resize before uploading, and crop to the region that matters.