When documents exceed a model's context window — or when you're building a RAG (Retrieval-Augmented Generation) pipeline — you need to split text into smaller pieces called chunks. The chunking strategy you choose directly affects retrieval quality, token efficiency, and the accuracy of your AI's responses.
Why Chunking Matters
Poor chunking leads to two problems:
- Chunks too large: Waste tokens on irrelevant content, dilute the embedding signal, and may exceed context limits
- Chunks too small: Lose context, split ideas across boundaries, and require more retrieval calls
The goal is chunks that are self-contained units of meaning — large enough to be useful, small enough to be focused.
Strategy 1: Fixed-Size Chunking
The simplest approach: split text into chunks of a fixed token or character count with optional overlap.
import tiktoken
def fixed_size_chunks(text, chunk_size=500, overlap=50,
model="gpt-4o"):
"""Split text into fixed-size token chunks with overlap."""
enc = tiktoken.encoding_for_model(model)
tokens = enc.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = start + chunk_size
chunk_tokens = tokens[start:end]
chunks.append(enc.decode(chunk_tokens))
start = end - overlap # overlap for context continuity
return chunks
# Usage
chunks = fixed_size_chunks(long_document, chunk_size=500,
overlap=50)
print(f"Created {len(chunks)} chunks")
When to Use Fixed-Size
- Unstructured text without clear section boundaries
- When you need predictable chunk sizes for embedding models
- As a baseline to compare against smarter strategies
Recommended sizes: 200–500 tokens for retrieval, 500–1000 tokens for summarization. Use 10–20% overlap to avoid splitting sentences.
Strategy 2: Recursive Character Splitting
This strategy tries to split on natural boundaries — paragraphs first, then sentences, then words — falling back to smaller separators only when chunks are still too large.
def recursive_split(text, chunk_size=500, separators=None):
"""Split text recursively on natural boundaries."""
if separators is None:
separators = ["\n\n", "\n", ". ", " ", ""]
chunks = []
sep = separators[0]
remaining_seps = separators[1:]
parts = text.split(sep) if sep else list(text)
current_chunk = ""
for part in parts:
candidate = current_chunk + sep + part if current_chunk \
else part
if len(candidate) <= chunk_size:
current_chunk = candidate
else:
if current_chunk:
chunks.append(current_chunk.strip())
# If single part exceeds chunk_size, split further
if len(part) > chunk_size and remaining_seps:
chunks.extend(
recursive_split(part, chunk_size,
remaining_seps)
)
current_chunk = ""
else:
current_chunk = part
if current_chunk:
chunks.append(current_chunk.strip())
return [c for c in chunks if c]
This is the approach used by LangChain's RecursiveCharacterTextSplitter and is a solid default for most use cases.
Strategy 3: Semantic Chunking
Semantic chunking uses embeddings to detect topic shifts and split at natural semantic boundaries. It produces the highest-quality chunks but requires more computation.
import numpy as np
from openai import OpenAI
client = OpenAI()
def get_embedding(text):
"""Get embedding for a text segment."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def semantic_chunks(text, threshold=0.5):
"""Split text at semantic boundaries."""
# Split into sentences first
sentences = text.replace(". ", ".\n").split("\n")
sentences = [s.strip() for s in sentences if s.strip()]
# Get embeddings for each sentence
embeddings = [get_embedding(s) for s in sentences]
# Find semantic breaks (low similarity between adjacent)
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
similarity = np.dot(embeddings[i-1], embeddings[i])
if similarity < threshold:
# Semantic break detected
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
When to Use Semantic Chunking
- Documents with multiple distinct topics
- When retrieval precision is critical
- When you can afford the extra embedding API calls during ingestion
Choosing the Right Chunk Size
The optimal chunk size depends on your use case:
- Question answering: 200–400 tokens. Smaller chunks mean more precise retrieval.
- Summarization: 500–1,000 tokens. Larger chunks preserve more context per retrieval.
- Code analysis: Split by function or class boundaries rather than token count.
- Legal/medical documents: Split by section headers to preserve document structure.
Overlap: How Much Is Enough?
Overlap prevents information loss at chunk boundaries. A sentence that gets split between two chunks might be missed during retrieval if there's no overlap.
- 10% overlap: Good default for most text
- 20% overlap: Better for dense technical content
- 0% overlap: Acceptable when splitting on natural boundaries (paragraphs, sections)
There's no universally optimal chunk size. Start with 400 tokens and 10% overlap, measure retrieval quality on your actual queries, and adjust from there.