RAG Quality: Measuring Retrieval Accuracy
Retrieval Quality Metrics
Retrieval quality measures whether the right documents reach the generator. The three most important retrieval metrics are recall at k, precision at k, and mean reciprocal rank.
Recall at k measures what fraction of all relevant documents appear in the top k results. If a query has 5 relevant documents in the knowledge base and the retriever returns 3 of them in its top 10 results, recall at 10 is 0.6. High recall means the retriever is not missing relevant information. Low recall means good answers exist in the knowledge base but the retriever is not finding them.
Precision at k measures what fraction of the top k results are actually relevant. If the retriever returns 10 chunks and 6 are relevant, precision at 10 is 0.6. High precision means the generator receives mostly relevant context. Low precision means the context is diluted with irrelevant chunks that waste context window space and may confuse the generator.
Mean reciprocal rank (MRR) measures how high the first relevant document ranks in the results. If the first relevant document is the top result, the reciprocal rank is 1. If it is the third result, the reciprocal rank is 1/3. MRR averaged across queries indicates how quickly the retriever surfaces relevant content. High MRR means the most relevant chunks appear near the top of results, which is important because generators attend most strongly to early context.
Computing these metrics requires a labeled evaluation set: a collection of representative queries paired with their correct relevant documents from the knowledge base. Building this evaluation set is an upfront investment that pays for itself many times over. Use real user queries (not synthetic ones), have domain experts identify the relevant documents for each query, and aim for at least 50-100 query-document pairs covering the full range of topics in your knowledge base.
Generation Quality Metrics
Generation quality measures how well the model uses the retrieved context to produce a useful response. The three key dimensions are faithfulness, relevance, and completeness.
Faithfulness (also called groundedness) measures whether every claim in the response is actually supported by the retrieved context. A response that invents facts not present in the context, even plausible-sounding ones, fails the faithfulness test. This is the most critical metric for RAG because the entire purpose of retrieval is to ground responses in verified information. High faithfulness means the model is using the context rather than hallucinating.
Relevance measures whether the response actually answers the question that was asked. A response can be perfectly faithful to the context but still irrelevant if it discusses a tangentially related topic rather than directly addressing the query. Relevance catches cases where the retriever found the right documents but the generator latched onto the wrong aspect of the context.
Completeness measures whether the response covers all aspects of the query that are addressed in the retrieved context. A response that answers part of a multi-faceted question but ignores other aspects fails the completeness test. This metric is especially important for complex queries that require synthesizing information across multiple chunks.
Evaluation Frameworks
RAGAS (Retrieval Augmented Generation Assessment) is the most widely adopted automated evaluation framework for RAG systems. It provides metrics for faithfulness, answer relevancy, context precision, and context recall, using LLM-as-judge approaches to score each dimension without requiring human evaluation for every query. RAGAS can be integrated into CI/CD pipelines for continuous quality monitoring.
DeepEval provides a testing framework specifically designed for LLM applications, with RAG-specific metrics including contextual relevancy, faithfulness, and hallucination detection. It supports custom metrics and integrates with popular testing frameworks, making it suitable for teams that want to include RAG quality checks in their standard test suites.
TruLens focuses on observability and evaluation with a dashboard that tracks RAG quality over time. It instruments the RAG pipeline to log each component input and output, enabling detailed analysis of where quality degrades. TruLens is especially useful for identifying intermittent quality issues that only appear under specific query patterns.
For production systems, a combination of automated evaluation (running continuously on sampled queries) and periodic human evaluation (domain experts reviewing a batch of responses) provides the most reliable quality signal. Automated metrics catch regressions quickly, while human evaluation catches subtle quality issues that automated scoring may miss.
Diagnosing Common Failure Modes
Retrieval miss occurs when relevant documents exist in the knowledge base but the retriever does not return them. Diagnose by checking recall metrics. Common causes include embedding model weakness on domain-specific terminology, chunking that splits relevant information across boundaries, or missing keyword matching for exact terms. Fix by adding hybrid search, trying a different embedding model, adjusting chunk size and overlap, or adding domain-specific query expansion.
Context dilution occurs when relevant chunks are returned but buried among irrelevant ones. The generator receives too many marginally related chunks and cannot identify the truly useful information. Diagnose by checking precision metrics. Fix by adding reranking, reducing the top-k parameter, or improving metadata filtering to narrow retrieval scope.
Hallucination occurs when the generator produces claims not supported by the retrieved context. This may happen because the model falls back on training knowledge, or because it extrapolates beyond what the context states. Diagnose by checking faithfulness metrics. Fix by strengthening the system prompt to emphasize using only the provided context, reducing the model temperature, or switching to a model with stronger instruction-following capabilities.
Stale information occurs when the knowledge base contains outdated documents that contradict current information. The retriever may return both old and new versions, confusing the generator. Diagnose by reviewing timestamps on retrieved chunks. Fix by implementing version-aware retrieval that prefers recent documents, removing outdated content from the index, or adding date metadata to chunks so the generator can identify which sources are current.
Building a Quality Feedback Loop
Sustained RAG quality requires a continuous feedback loop. Users flag incorrect or unhelpful responses. These flagged responses are reviewed to identify the root cause (retrieval failure, context dilution, hallucination, or missing knowledge). Fixes are applied to the specific component responsible. The evaluation set is updated with new test cases based on discovered failures. And metrics are monitored to confirm the fix improved quality without degrading other aspects.
This loop ensures that quality improves over time rather than degrading as the knowledge base grows and query patterns evolve. Teams that invest in this infrastructure early avoid the costly cycle of reacting to user complaints without systematic understanding of root causes.
Setting Quality Baselines and Targets
Before optimizing, establish a baseline by running your evaluation set against the current system and recording all metrics. This baseline tells you where you are starting and helps prioritize which metrics need the most improvement. A system with high recall but low faithfulness needs generator-side fixes (better prompting, stronger model). A system with low recall but high faithfulness needs retrieval-side fixes (better embeddings, hybrid search, improved chunking).
Set realistic quality targets based on your use case. A customer support system might target 90% faithfulness and 75% recall. A legal research system might target 98% faithfulness and 85% recall. A casual FAQ bot might accept lower thresholds. The cost of errors varies dramatically by domain, and your quality targets should reflect how much damage an incorrect answer causes in your specific context.
Setting Quality Baselines and Targets
Before optimizing, establish baseline metrics by evaluating your current system against the evaluation set. Common starting baselines for a well-configured RAG system are recall at 5 of 0.7-0.8, precision at 5 of 0.5-0.7, and faithfulness of 0.8-0.9. These baselines will vary by domain complexity and knowledge base quality.
Set realistic improvement targets based on your application requirements. Customer support systems may prioritize faithfulness above 0.95 (wrong answers are costly) even at the expense of lower recall. Research systems may prioritize recall above 0.9 (missing relevant sources is unacceptable) while accepting somewhat lower precision. Internal knowledge management may balance all metrics equally. Align your optimization efforts with the metrics that matter most for your specific use case.
Measure retrieval and generation quality independently using recall, precision, MRR for retrieval and faithfulness, relevance, completeness for generation. Build a labeled evaluation set from real queries, use automated frameworks like RAGAS for continuous monitoring, and invest in a feedback loop that connects user reports to systematic root cause analysis and fixes.