Chunking Strategies for Long Documents

When documents exceed a model's context window — or when you're building a RAG (Retrieval-Augmented Generation) pipeline — you need to split text into smaller pieces called chunks. The chunking strategy you choose directly affects retrieval quality, token efficiency, and the accuracy of your AI's responses.

Why Chunking Matters

Poor chunking leads to two problems:

Chunks too large: Waste tokens on irrelevant content, dilute the embedding signal, and may exceed context limits
Chunks too small: Lose context, split ideas across boundaries, and require more retrieval calls

The goal is chunks that are self-contained units of meaning — large enough to be useful, small enough to be focused.

Strategy 1: Fixed-Size Chunking

The simplest approach: split text into chunks of a fixed token or character count with optional overlap.

import tiktoken

def fixed_size_chunks(text, chunk_size=500, overlap=50,
                      model="gpt-4o"):
    """Split text into fixed-size token chunks with overlap."""
    enc = tiktoken.encoding_for_model(model)
    tokens = enc.encode(text)
    chunks = []

    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunks.append(enc.decode(chunk_tokens))
        start = end - overlap  # overlap for context continuity

    return chunks

# Usage
chunks = fixed_size_chunks(long_document, chunk_size=500,
                           overlap=50)
print(f"Created {len(chunks)} chunks")

When to Use Fixed-Size

Unstructured text without clear section boundaries
When you need predictable chunk sizes for embedding models
As a baseline to compare against smarter strategies

Recommended sizes: 200–500 tokens for retrieval, 500–1000 tokens for summarization. Use 10–20% overlap to avoid splitting sentences.

Strategy 2: Recursive Character Splitting

This strategy tries to split on natural boundaries — paragraphs first, then sentences, then words — falling back to smaller separators only when chunks are still too large.

def recursive_split(text, chunk_size=500, separators=None):
    """Split text recursively on natural boundaries."""
    if separators is None:
        separators = ["\n\n", "\n", ". ", " ", ""]

    chunks = []
    sep = separators[0]
    remaining_seps = separators[1:]

    parts = text.split(sep) if sep else list(text)

    current_chunk = ""
    for part in parts:
        candidate = current_chunk + sep + part if current_chunk \
                    else part
        if len(candidate) <= chunk_size:
            current_chunk = candidate
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            # If single part exceeds chunk_size, split further
            if len(part) > chunk_size and remaining_seps:
                chunks.extend(
                    recursive_split(part, chunk_size,
                                    remaining_seps)
                )
                current_chunk = ""
            else:
                current_chunk = part

    if current_chunk:
        chunks.append(current_chunk.strip())

    return [c for c in chunks if c]

This is the approach used by LangChain's RecursiveCharacterTextSplitter and is a solid default for most use cases.

Strategy 3: Semantic Chunking

Semantic chunking uses embeddings to detect topic shifts and split at natural semantic boundaries. It produces the highest-quality chunks but requires more computation.

import numpy as np
from openai import OpenAI

client = OpenAI()

def get_embedding(text):
    """Get embedding for a text segment."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def semantic_chunks(text, threshold=0.5):
    """Split text at semantic boundaries."""
    # Split into sentences first
    sentences = text.replace(". ", ".\n").split("\n")
    sentences = [s.strip() for s in sentences if s.strip()]

    # Get embeddings for each sentence
    embeddings = [get_embedding(s) for s in sentences]

    # Find semantic breaks (low similarity between adjacent)
    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        similarity = np.dot(embeddings[i-1], embeddings[i])
        if similarity < threshold:
            # Semantic break detected
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

When to Use Semantic Chunking

Documents with multiple distinct topics
When retrieval precision is critical
When you can afford the extra embedding API calls during ingestion

Choosing the Right Chunk Size

The optimal chunk size depends on your use case:

Question answering: 200–400 tokens. Smaller chunks mean more precise retrieval.
Summarization: 500–1,000 tokens. Larger chunks preserve more context per retrieval.
Code analysis: Split by function or class boundaries rather than token count.
Legal/medical documents: Split by section headers to preserve document structure.

Overlap: How Much Is Enough?

Overlap prevents information loss at chunk boundaries. A sentence that gets split between two chunks might be missed during retrieval if there's no overlap.

10% overlap: Good default for most text
20% overlap: Better for dense technical content
0% overlap: Acceptable when splitting on natural boundaries (paragraphs, sections)

There's no universally optimal chunk size. Start with 400 tokens and 10% overlap, measure retrieval quality on your actual queries, and adjust from there.