How RAG Works: Retrieval, Context, Generation
Phase One: Document Indexing
Before a RAG system can answer questions, it needs to process and index the knowledge base. This indexing phase runs offline and prepares documents for fast retrieval during live queries.
The process begins with document loading, where raw files are read from their source format. PDFs, web pages, Markdown files, database exports, and API responses all need to be converted into clean text. This parsing step is often underestimated, but it directly affects downstream quality. A PDF parser that drops table data or mishandles multi-column layouts will produce chunks that miss critical information. Production systems use specialized parsers for each format and include post-processing steps to clean up structural artifacts like headers, footers, and page numbers.
After parsing, documents are split into smaller pieces called chunks. Chunking determines the granularity of retrieval. Each chunk becomes an independent retrieval unit that can be found and returned in response to a query. The chunking strategy, including chunk size, overlap, and split boundaries, has a major impact on retrieval quality. Typical chunk sizes range from 256 to 1024 tokens, with 10-20% overlap between adjacent chunks to preserve context at boundaries.
Each chunk is then converted into a vector embedding, a numerical array that represents the chunk's semantic meaning in high-dimensional space. The embedding model maps text into vectors such that semantically similar texts produce vectors that are close together. For example, a chunk about "machine learning model training" would produce a vector close to a chunk about "neural network optimization" because these topics are semantically related, even though they share few words.
The resulting vectors, along with the original text and any metadata (source document, page number, section heading), are stored in a vector database. This database provides fast approximate nearest-neighbor search, allowing the system to find relevant chunks among millions of vectors in milliseconds.
Phase Two: The Query Pipeline
When a user submits a question, the query pipeline activates. This is the real-time phase where retrieval, context assembly, and generation happen in sequence.
Step 1: Query Processing
The raw user query is first processed to improve retrieval effectiveness. In simple systems, the query is used as-is. In advanced systems, query processing may include rewriting the query into a more search-friendly form, expanding abbreviations and acronyms, decomposing complex multi-part questions into separate sub-queries, or generating a hypothetical answer (HyDE) that is then used as the search query instead. Each of these techniques improves the chances of finding relevant documents.
Step 2: Retrieval
The processed query is converted into a vector using the same embedding model that was used during indexing. This query vector is sent to the vector database, which performs a similarity search to find the k most similar document vectors. The similarity is typically measured using cosine similarity or dot product, both of which quantify how close two vectors are in the embedding space.
Modern systems go beyond pure vector search. Hybrid retrieval combines vector similarity with traditional keyword matching (BM25 or TF-IDF). Vector search captures semantic meaning, so "automobile" matches "car," while keyword search catches exact terms, product codes, and technical identifiers that embedding models may miss. The results from both methods are merged using reciprocal rank fusion or a learned combining function.
After initial retrieval, a reranking step applies a more computationally expensive model, typically a cross-encoder, to rescore each query-document pair. Cross-encoders consider the query and document jointly rather than comparing pre-computed vectors, producing more accurate relevance judgments. Because cross-encoders are slow, reranking is applied only to the top 20-50 initial results.
Step 3: Context Assembly
The top-ranked chunks from retrieval and reranking are assembled into a context window. The system must decide how many chunks to include (the top-k parameter), how to order them within the context, and how to format them for the language model.
Research on the "lost in the middle" phenomenon shows that language models attend most strongly to information at the beginning and end of their context, with weaker attention to middle positions. Effective context assembly strategies place the most relevant chunks at the beginning, followed by supporting context, with the least critical pieces in the middle. Some systems also include metadata with each chunk (source document name, section heading) to help the model attribute its answers.
Step 4: Generation
The assembled context is combined with the user's original question and a system prompt that instructs the model on how to use the context. A typical system prompt tells the model to answer based on the provided context, to cite the specific chunks it draws from, and to indicate when the context does not contain enough information to answer the question confidently.
The language model processes this augmented prompt and generates a response. The quality of the response depends on the model's instruction-following ability, its capacity to synthesize information across multiple chunks, and its tendency to stay grounded in the provided context rather than drawing on potentially outdated training knowledge.
The Feedback Loop
Production RAG systems include monitoring and feedback mechanisms to maintain quality over time. Key metrics include retrieval recall (are the right documents being found), response faithfulness (does the answer match the retrieved context), and user satisfaction (are users marking answers as helpful). When metrics degrade, the system may need tuning: adjusting chunk sizes, switching embedding models, updating the reranker, or revising the system prompt.
Some advanced systems implement active learning loops where failed retrievals and user corrections are fed back into the system to improve future performance. These feedback loops are especially valuable in domains where the knowledge base changes frequently and the retrieval system needs to adapt continuously.
End-to-End Latency Considerations
In production RAG systems, the total response time from query to answer typically ranges from 500 milliseconds to 3 seconds. Understanding where time is spent helps optimize the pipeline. Query embedding takes 10-50 milliseconds using a dedicated embedding API. Vector search in a well-indexed database returns results in 10-100 milliseconds depending on collection size and index type. Reranking adds 100-500 milliseconds depending on the number of candidates and the reranker model size. And generation is usually the largest component, taking 500-2000 milliseconds depending on the model, response length, and whether streaming is enabled.
Streaming the generation step, where tokens are sent to the user as they are produced rather than waiting for the complete response, significantly improves perceived latency. Users see the first words within a second even when the full response takes several seconds to complete. Most production RAG deployments use streaming for this reason.
Common Failure Points in the Pipeline
Understanding where RAG pipelines fail helps teams prioritize their optimization efforts. The most common failure point is retrieval: the system simply does not find the relevant documents. This happens when the embedding model does not capture domain-specific terminology, when chunking splits critical information across boundaries, or when the knowledge base is missing the information entirely. Retrieval failures are insidious because the generator will still produce a response, just one based on irrelevant context.
The second most common failure is context misuse, where the retriever finds the right documents but the generator ignores them or misinterprets them. This happens more frequently with smaller models that have weaker instruction-following abilities, or when the system prompt does not clearly instruct the model to prioritize retrieved context over its training knowledge. Stronger models and explicit prompting reduce this failure mode significantly.
A third category of failure involves stale or contradictory context. When the knowledge base contains outdated documents alongside current ones, the retriever may return both, and the generator must decide which to trust. Without metadata like publication dates or version numbers attached to chunks, the model has no way to distinguish current from outdated information. Adding temporal metadata to chunks and instructing the model to prefer recent sources addresses this problem.
RAG works through a precise sequence of indexing, retrieval, context assembly, and generation. Each step has direct impact on response quality, and the most effective systems optimize each stage independently while monitoring end-to-end performance.