JSON vs YAML vs XML: Which Format Uses Fewer Tokens?

When you pass structured data inside an LLM prompt, the format you choose directly affects your token count. JSON, YAML, and XML all represent the same information, but they tokenize very differently. Picking the right format can cut token usage by 30–50% with zero loss of information.

A Real Comparison

Let's encode the same data — two users with names and roles — in all three formats and count the tokens using GPT-4o's tokenizer (o200k_base).

JSON (38 tokens)

{
  "users": [
    {"name": "Alice", "role": "admin", "active": true},
    {"name": "Bob", "role": "editor", "active": false}
  ]
}

YAML (23 tokens)

users:
  - name: Alice
    role: admin
    active: true
  - name: Bob
    role: editor
    active: false

XML (53 tokens)

<users>
  <user>
    <name>Alice</name>
    <role>admin</role>
    <active>true</active>
  </user>
  <user>
    <name>Bob</name>
    <role>editor</role>
    <active>false</active>
  </user>
</users>

The results are clear: YAML uses ~40% fewer tokens than JSON, and XML uses ~40% more. The gap widens as data grows — with 100 records, XML can use 2x the tokens of YAML.

Why the Difference?

The token cost comes down to syntax overhead:

JSON requires quotes around every key and string value, plus braces, brackets, and commas. Each " and { consumes a token.
YAML uses indentation and colons instead of delimiters. No quotes needed for simple strings. Fewer special characters means fewer tokens.
XML repeats every tag name twice (opening and closing), and angle brackets tokenize as separate tokens. A field like <name>Alice</name> uses 7 tokens where YAML's name: Alice uses 3.

The Even Cheaper Option: CSV

For tabular data, CSV beats all three formats:

name,role,active
Alice,admin,true
Bob,editor,false

This encodes the same data in roughly 14 tokens — 63% fewer than JSON. The tradeoff is that CSV can't represent nested structures, so it only works for flat data.

When to Use Each Format

The best format depends on your use case:

YAML — Best for passing structured data in prompts where you control the format. Lowest token cost with full nesting support.
JSON — Best when you need the model to output structured data. Models are more reliable at generating valid JSON than YAML, and most APIs expect JSON responses.
CSV — Best for flat, tabular data like lists of products, users, or log entries. Minimal token overhead.
XML — Avoid in prompts unless the model specifically needs XML context (e.g., processing SOAP APIs or HTML). The token cost is rarely justified.

Practical Tip: Mixed Strategy

Use YAML or CSV for input data in your prompts, and ask the model to respond in JSON. This gives you the cheapest input tokens (which you pay for on every request) while getting reliably parseable output.

# System prompt
Respond in JSON with fields: name, summary, score.

# User data (YAML — cheap input)
products:
  - name: Widget Pro
    reviews: 4.5 stars, 230 reviews
    price: $29.99
  - name: Gadget Plus
    reviews: 3.8 stars, 89 reviews
    price: $49.99

Rule of thumb: YAML for input, JSON for output, CSV for flat data. Avoid XML unless you have a specific reason.