How to Optimize RAG Retrieval Quality

Updated May 2026
Optimizing RAG retrieval quality is a systematic process of measuring where the system underperforms, making targeted changes to the specific component responsible, and verifying the improvement against your evaluation set. Most RAG systems reach acceptable quality quickly with basic configuration, but reaching high quality requires disciplined iteration across chunking, embedding, retrieval, and generation.

This guide assumes you have a working RAG pipeline and want to improve its output quality. Each optimization step targets a specific component and includes guidance on when it helps most and when to skip it. Work through the steps in order, as earlier optimizations often change which later steps are needed.

Step 1: Establish Baseline Metrics

Before changing anything, measure your current system quality. Build an evaluation set of 50 to 100 representative queries drawn from real user questions, not synthetic examples. For each query, identify the correct source documents in your knowledge base. This labeled set becomes your ground truth for all optimization work.

Run your current system against this evaluation set and record recall at 5, precision at 5, mean reciprocal rank for retrieval quality, and faithfulness for generation quality. These four metrics give you a complete picture of where the system succeeds and fails. Low recall means the retriever misses relevant documents. Low precision means too many irrelevant documents reach the generator. Low MRR means relevant documents rank too low. Low faithfulness means the generator invents information not in the context.

Review the worst-performing queries manually. Understanding why specific queries fail reveals which component needs attention. If the correct documents never appear in retrieval results, the problem is upstream (chunking, embedding, or search configuration). If the correct documents are retrieved but the response is wrong, the problem is in the generator (prompt, context formatting, or model choice).

Step 2: Optimize Chunking

Chunking is the foundation of retrieval quality because it determines what units of information the retriever can find and return. If a chunk is too large, it contains irrelevant information that dilutes the signal. If too small, it lacks the context needed for a complete answer. If the split falls in the middle of a key passage, the relevant information is broken across two chunks and neither is sufficient alone.

Test three to four chunk sizes against your evaluation set: 256 tokens, 512 tokens, 1024 tokens, and the maximum your embedding model supports. For each size, re-index your knowledge base and run the full evaluation. The optimal size depends on your content type. Dense technical documentation often performs best at 256 to 512 tokens. Narrative content like articles and reports often works better at 512 to 1024 tokens.

Overlap between chunks prevents information loss at boundaries. Start with 10 to 20 percent overlap (50 to 100 tokens for a 512-token chunk). Higher overlap increases the number of chunks and storage requirements but improves recall for information near chunk boundaries.

If your documents have clear structural markers like headings, section breaks, or numbered lists, try recursive chunking that splits at these semantic boundaries first and falls back to token-based splits for oversized sections. For code, use AST-based chunking that respects function and class boundaries. For tables, keep entire tables as single chunks with their column headers preserved.

Step 3: Upgrade Your Embedding Model

The embedding model determines how well semantic similarity captures the relationship between queries and relevant documents. A model trained on general web text may not capture domain-specific relationships in medical, legal, or engineering content. Testing multiple models against your evaluation set is the only reliable way to identify the best fit.

Start by comparing your current model against two or three alternatives. For API-based models, compare OpenAI text-embedding-3-small, text-embedding-3-large, and Cohere embed-v3. For open-source models, compare BGE-M3, GTE-large, and E5-large-v2. Each model has different strengths: some handle technical terminology better, some are stronger at multilingual content, and some perform better on longer chunks.

For highly specialized domains, consider fine-tuning an embedding model on your own data. Create training pairs from your content where each pair consists of a query and a relevant passage. Fine-tuning on a few thousand pairs from your domain often produces significant improvements in retrieval accuracy, especially for content with specialized vocabulary that general models handle poorly.

When switching embedding models, you must re-embed and re-index your entire knowledge base. The new embeddings live in a different vector space and cannot be compared with old embeddings. Plan for this re-indexing cost when evaluating model changes.

Step 4: Add Hybrid Search

Pure vector search relies entirely on semantic similarity, which misses cases where exact keyword matching is important. Product names, error codes, API method names, and specific identifiers are often better matched by keyword search than by vector similarity. Hybrid search combines both approaches to capture semantic relevance and exact matches.

Implement BM25 keyword search alongside your existing vector search. BM25 uses term frequency and inverse document frequency to rank documents by keyword relevance. Most vector databases (Qdrant, Weaviate) support hybrid search natively. For pgvector, add a tsvector column and combine PostgreSQL full-text search with vector similarity.

Merge results from both retrieval methods using reciprocal rank fusion (RRF). RRF assigns a score to each document based on its rank in each result list: score = 1 / (k + rank), where k is a constant (typically 60). Documents that appear in both lists receive the sum of their scores, naturally boosting results that both methods consider relevant. This simple fusion strategy consistently outperforms either method alone.

After implementing hybrid search, re-run your evaluation set and compare metrics against your vector-only baseline. Hybrid search typically improves recall by 5 to 15 percent and is especially impactful for queries containing specific identifiers or technical terms.

Step 5: Add Reranking

Reranking uses a cross-encoder model to rescore the initial retrieval results with much higher accuracy than the bi-encoder embedding model used for initial retrieval. The tradeoff is speed: cross-encoders process each query-document pair individually rather than using pre-computed embeddings, so they can only be applied to a small set of candidates. Retrieve a larger initial set (top 20 to 50 results) and rerank them to select the final top 5.

Use Cohere Rerank, Jina Reranker, or an open-source cross-encoder like ms-marco-MiniLM for reranking. These models take a query and a document as input and output a relevance score. Sort the initial results by reranker score and take the top k for your context window.

Reranking typically provides the single largest quality improvement after basic setup. It corrects ranking errors from the initial retrieval, pushing the most relevant chunks to the top where the generator attends most strongly. The improvement is most significant when initial retrieval returns many marginally relevant results and the reranker separates the truly relevant ones from the noise.

Monitor reranking latency since it adds a synchronous step to every query. For most reranking APIs, scoring 20 to 50 chunks takes 100 to 300 milliseconds. If latency is critical, reduce the candidate set size or use a smaller, faster reranker model.

Step 6: Implement Query Rewriting

User queries are often poorly formulated for retrieval. They may be too vague, use different terminology than the knowledge base, contain abbreviations, or ask complex multi-part questions. Query rewriting transforms the original query into one or more optimized queries that retrieve better results.

Query expansion adds synonyms, related terms, and full forms of abbreviations to the query. If a user asks about "k8s autoscaling," the expanded query might include "Kubernetes autoscaling horizontal pod autoscaler." This helps the retriever match documents that use different terminology for the same concept.

Sub-question decomposition breaks complex questions into simpler sub-queries that can each be answered independently. A question like "Compare the performance and cost of Pinecone and Qdrant for a 10 million vector collection" becomes three sub-queries: one about Pinecone performance, one about Qdrant performance, and one about cost comparison. Each sub-query retrieves focused results that together provide comprehensive coverage.

Hypothetical document embedding (HyDE) generates a hypothetical answer to the query using the LLM without context, then uses that hypothetical answer as the retrieval query instead of the original question. This works because the hypothetical answer is closer in embedding space to the actual relevant documents than a short question would be. HyDE is particularly effective for abstract or conceptual queries where the question and the answer use very different language.

Step 7: Tune the Generator

Generator optimization focuses on how the language model uses retrieved context to produce responses. The system prompt is the primary control surface. A well-designed system prompt instructs the model to answer only from the provided context, cite specific sources for each claim, indicate when the context does not contain sufficient information, and avoid generating information from its training data.

Context formatting affects how well the generator identifies and uses relevant information. Number each retrieved chunk clearly ("Source 1:", "Source 2:") and include metadata like the document title and section heading. This structured formatting helps the model attribute claims to specific sources and improves citation accuracy.

Reduce the model temperature for RAG generation. Higher temperatures increase randomness and make the model more likely to deviate from the provided context. A temperature of 0.0 to 0.3 works well for factual RAG applications where accuracy matters more than creative variation. Reserve higher temperatures for use cases where diverse phrasing is desirable.

If faithfulness remains low after prompt optimization, try a stronger instruction-following model. Models with better instruction adherence are less likely to hallucinate or ignore the system prompt directive to use only the provided context. Run your evaluation set against the new model and compare faithfulness scores to quantify the improvement.

Optimization Priority Order

Not all optimizations provide equal value, and the order matters. Based on typical impact across production RAG systems, prioritize in this order: First, fix chunking if your chunks are too large, too small, or splitting in the wrong places. Second, add hybrid search if your queries include specific terms, codes, or identifiers. Third, add reranking if precision at the top of results is poor. Fourth, try a different embedding model if retrieval accuracy remains below your target after the previous steps. Fifth, implement query rewriting if users frequently use vague or complex queries. Sixth, tune the generator prompt if faithfulness or relevance is the bottleneck.

Each optimization should be evaluated independently against your baseline. Change one variable at a time and measure the impact. If an optimization does not improve your target metric, revert it before moving to the next step. This disciplined approach prevents the common trap of stacking changes without understanding which ones actually help.

When to Stop Optimizing

Optimization has diminishing returns. The first few changes (chunking, hybrid search, reranking) typically produce large improvements. Later changes (embedding fine-tuning, advanced query rewriting) produce smaller gains at higher implementation cost. Stop optimizing when your metrics meet your quality targets, when additional changes produce less than 1 to 2 percent improvement, or when the cost of further optimization exceeds the value of the quality improvement.

Focus your remaining effort on maintaining quality rather than pursuing marginal gains. Set up continuous evaluation that runs automatically against your evaluation set, catching quality regressions from knowledge base updates, model changes, or configuration drift. A monitoring system that alerts on metric drops is more valuable than a one-time optimization that pushes metrics slightly higher.

Key Takeaway

Optimize systematically by measuring your baseline, making one change at a time, and verifying the impact against your evaluation set. Prioritize high-impact changes first: chunking, hybrid search, and reranking typically provide the largest improvements. Stop when your metrics meet your quality targets, then invest in monitoring to maintain that quality over time.