Chunking Strategies for RAG: Size, Overlap, Methods
Why Chunking Matters
Every RAG query returns a set of chunks, not whole documents. The generator model only sees these chunks, so their quality directly determines answer quality. A chunk that is too small may lack the context needed to be meaningful. A paragraph fragment like "This is achieved through the process described above" is useless without knowing what "the process" refers to. A chunk that is too large may include irrelevant information that dilutes the relevant content and wastes limited context window space.
Chunking also determines retrieval precision. Each chunk gets its own embedding vector, and the retriever matches queries against these vectors. A chunk that covers a single focused topic will produce a specific vector that matches relevant queries well. A chunk that covers multiple topics will produce a blended vector that matches many queries weakly, reducing retrieval precision and producing lower-quality results.
Fixed-Size Chunking
Fixed-size chunking splits documents into segments of a predetermined token count, typically between 256 and 1024 tokens, with an overlap between adjacent chunks. The overlap, usually 10-20% of the chunk size, ensures that information spanning a boundary appears in at least one complete chunk.
This is the simplest strategy and the best starting point for most projects. It requires no content analysis, produces consistent chunk sizes that are easy to reason about, and works reasonably well across content types. The main drawback is that it ignores document structure entirely. A fixed-size chunker will happily split a paragraph in half, separate a heading from its content, or break a code block in the middle, producing fragments that lack coherent meaning.
Practical recommendations for fixed-size chunking: start with 512 tokens and 50-token overlap. If retrieval recall is low, try smaller chunks (256 tokens) to increase precision. If retrieved chunks lack context, try larger chunks (1024 tokens) at the cost of reduced precision. The optimal size depends on your embedding model (which has a maximum input length), your content type, and the types of queries your system handles.
Semantic Chunking
Semantic chunking uses natural language processing to identify topic boundaries within a document. The approach works by computing the semantic similarity between consecutive sentences or paragraphs. When similarity drops below a threshold, indicating a topic shift, the chunker inserts a break. The result is chunks that are topically coherent, where each chunk covers a single concept or idea.
Semantic chunking produces higher-quality chunks than fixed-size splitting because each chunk has a focused topic that generates a specific embedding vector. This improves retrieval precision, as queries are more likely to match chunks that are genuinely about the queried topic rather than chunks that happen to contain a relevant sentence alongside irrelevant content.
The tradeoffs are variable chunk sizes (some topics take one paragraph, others take ten), higher computational cost during indexing (each sentence boundary requires a similarity computation), and sensitivity to the similarity threshold parameter. Too high a threshold produces tiny chunks, too low produces large unfocused chunks. Tuning requires experimentation with representative content.
Recursive Chunking
Recursive chunking respects the document's existing hierarchical structure. It first attempts to split on the highest-level boundaries (major headings or section breaks), then recursively splits those sections on sub-headings, then on paragraphs, and finally on sentences, stopping when each piece falls below the target size.
This strategy preserves the author's intended organization, keeping related content together and respecting logical groupings. A section titled "Installation Requirements" stays as one chunk (or is split into sub-sections) rather than being arbitrarily split after 512 tokens. This structural awareness produces chunks that are both topically coherent and contextually complete.
Recursive chunking works especially well for well-structured content like technical documentation, API references, legal contracts, and academic papers. It works less well for unstructured content like email threads, chat transcripts, or free-form notes that lack consistent structural markers.
Parent-Child Chunking
Parent-child chunking stores documents at two granularity levels simultaneously. Small child chunks (128-256 tokens) are used as the retrieval unit, while larger parent chunks (1024-2048 tokens) that contain the child chunks are stored alongside them. When a child chunk matches a query, the system retrieves the parent chunk instead, providing broader context to the generator.
This two-level approach balances retrieval precision with context completeness. Small child chunks produce specific embedding vectors that match queries precisely. But instead of sending these small fragments to the generator, the system returns the surrounding parent chunk, giving the model enough context to produce a well-informed answer.
Parent-child chunking is particularly effective when answers require context that spans several paragraphs. A question about a function's return value might match a child chunk containing the return value description, but the generator needs to see the full function documentation, including parameters, usage examples, and error conditions, to produce a complete answer.
Specialized Chunking for Different Content Types
Code. Code requires chunking strategies that respect syntactic boundaries. Splitting in the middle of a function, class, or import block produces fragments that are difficult to embed meaningfully. Code-aware chunkers split on function definitions, class boundaries, or logical blocks, keeping each unit syntactically complete. For repositories with many small functions, each function can be its own chunk. For files with large classes, methods within classes become individual chunks.
Tables. Tabular data is poorly served by text-based chunking. A table split across chunks loses its structural relationships between rows and columns. The best approach is to treat each table as a single chunk (with surrounding context) or to convert tables into structured text ("Row 1: Column A is X, Column B is Y") before chunking. Some systems extract tables separately and store them as structured data with different retrieval mechanisms.
Conversational data. Chat logs, forum threads, and email chains have unique structure where context flows across messages. Fixed-size chunking breaks conversation threads at arbitrary points. A better approach is to chunk by conversation turn or by topic thread, keeping the full context of each exchange together. Including speaker identification and timestamps in each chunk helps the generator produce contextually accurate responses.
Measuring Chunking Quality
Chunking quality is measured indirectly through retrieval quality metrics. If changing the chunking strategy improves recall at k (more relevant chunks appear in the top results), the new strategy is better for your use case. A/B testing different strategies on a representative query set is the most reliable way to find the optimal approach.
Manual inspection of chunks is also valuable, especially during initial development. Reading through a sample of chunks from your knowledge base reveals whether they are coherent (each chunk makes sense on its own), complete (each chunk contains enough context to be useful), focused (each chunk covers a single topic), and appropriately sized (not so small they lack meaning, not so large they dilute relevance).
Chunk Metadata and Enrichment
Beyond the chunking method itself, attaching metadata to each chunk significantly improves retrieval quality and generation accuracy. Useful metadata includes the source document title, the section heading the chunk falls under, the page number or URL, the publication or modification date, and any tags or categories from the original document. This metadata serves multiple purposes: it enables filtering during retrieval (only search documents from a specific date range), it provides attribution context for the generator, and it helps the reranker make better relevance judgments.
Some systems also prepend context summaries to each chunk. Adding a line like "From the Installation Guide, section on System Requirements:" before the chunk text gives the embedding model additional context for creating a more representative vector. This technique, sometimes called contextual chunking, has been shown to improve retrieval recall by helping the embedding model understand what each chunk is about beyond its literal text content.
Start with fixed-size chunking at 512 tokens with 50-token overlap. Measure retrieval quality, then experiment with recursive or semantic chunking if the metrics show room for improvement. Use parent-child chunking when answers need broader context than what small, precise chunks provide. Always match your chunking strategy to your content type.