RAG Architecture: Components and Data Flow
System Overview
RAG architecture divides into two data paths that share common infrastructure. The ingestion path processes documents offline, converting raw content into searchable vector representations stored in a database. The query path handles real-time user requests, converting questions into vectors, searching for relevant content, assembling context, and generating responses. Both paths use the same embedding model to ensure that document vectors and query vectors occupy the same semantic space.
The separation of ingestion and query paths is a key architectural decision. Ingestion can run as batch jobs during low-traffic hours, process large document collections in parallel, and retry failures without affecting live users. The query path operates in real time with strict latency requirements, typically targeting sub-2-second end-to-end response times including retrieval and generation.
Document Processing Layer
The document processing layer transforms raw content from various formats into clean, structured text ready for chunking and embedding. This layer handles format-specific parsing (PDF extraction, HTML stripping, Office document conversion), content cleaning (removing headers, footers, navigation elements, boilerplate), metadata extraction (title, author, date, section headings), and structural analysis (identifying headings, paragraphs, lists, tables, code blocks).
Document quality at this stage has an outsized impact on downstream performance. A PDF parser that mishandles multi-column layouts will produce garbled text that creates poor embeddings. An HTML stripper that removes too aggressively might drop important content from tables or lists. Production systems typically use specialized parsers for each format and include validation steps that flag documents with suspiciously low text extraction rates.
The output of document processing feeds into the chunking engine, which splits cleaned documents into retrieval units. Chunking strategies range from simple fixed-size splitting to sophisticated approaches that respect document structure, semantic boundaries, or hierarchical organization. The choice of strategy depends on the content type, the embedding model, and the types of queries the system needs to handle.
Embedding Layer
The embedding layer converts text chunks into dense vector representations that capture semantic meaning. When two chunks are about similar topics, their vectors will be close together in the embedding space, even if they use different words. This semantic matching is what enables RAG to find relevant information based on meaning rather than keyword overlap.
The embedding model is one of the most consequential architectural choices. It determines the dimensionality of vectors (typically 384 to 3072 dimensions), the maximum input length per chunk, the quality of semantic matching across different content types and languages, and the computational cost of embedding generation. Popular choices in 2026 include OpenAI's text-embedding-3-large (3072 dimensions), Cohere's embed-v4 (1024 dimensions with multimodal support), and open-source models like BGE-M3 and E5-large-v2.
The same embedding model must be used for both document indexing and query embedding. Using different models for documents and queries produces vectors in different semantic spaces, making similarity search meaningless. When upgrading to a new embedding model, the entire document collection must be re-embedded, which is a significant operational consideration for large knowledge bases.
Vector Storage Layer
The vector storage layer persists embeddings and provides fast similarity search. This is typically a purpose-built vector database (Pinecone, Weaviate, Qdrant, Milvus, Chroma) or a traditional database with vector extensions (PostgreSQL with pgvector). The storage layer handles index construction for fast approximate nearest-neighbor search, metadata storage and filtering alongside vectors, concurrent read and write operations, and scaling strategies for growing collections.
Index type selection involves tradeoffs between search speed, recall accuracy, memory usage, and build time. HNSW (Hierarchical Navigable Small World) indices offer excellent speed and recall but require significant memory. IVF (Inverted File) indices use less memory but may sacrifice recall. Product quantization compresses vectors to reduce storage but introduces approximation error. Most production deployments use HNSW with appropriate parameters tuned for their collection size and query latency requirements.
Retrieval Layer
The retrieval layer orchestrates the search process when a query arrives. In its simplest form, it embeds the query and performs a vector similarity search. In production systems, it typically implements a multi-stage retrieval pipeline that includes query preprocessing (rewriting, expansion, decomposition), hybrid search combining vector similarity with keyword matching, initial candidate retrieval (top 50-100 results from vector search and keyword search), result merging using reciprocal rank fusion or a learned combining function, and metadata filtering to enforce access controls, date ranges, or source restrictions.
The retrieval layer is the most common point of failure in RAG systems. When the system produces wrong answers, the cause is usually that the retriever failed to find the relevant documents rather than the generator misusing correct context. Monitoring retrieval quality metrics (recall, precision, MRR) is essential for maintaining system health.
Reranking Layer
The reranking layer applies a more expensive, more accurate relevance scoring model to the candidates identified by initial retrieval. Cross-encoder rerankers process each query-document pair jointly, considering the interaction between query terms and document content. This produces more accurate relevance scores than the independent embedding comparison used in vector search.
Reranking significantly improves precision, ensuring that the chunks passed to the generator are truly relevant rather than merely semantically similar. The cost is additional latency (typically 100-300 milliseconds for reranking 50 candidates). This tradeoff is almost always worth it in production systems where answer quality matters more than marginal latency differences.
Generation Layer
The generation layer assembles the final prompt and produces the response. It constructs a system prompt that instructs the model on how to use retrieved context, orders the retrieved chunks within the context (placing the most relevant at the beginning and end), adds metadata annotations to help the model attribute its answers, and manages context window limits by truncating or summarizing chunks if the total exceeds the model's capacity.
The system prompt is a critical but often undertested component. It must instruct the model to base answers on the provided context, to cite specific sources, to acknowledge when the context does not contain sufficient information, and to avoid supplementing with potentially outdated training knowledge. Getting this prompt right, and testing it against a diverse set of queries, is essential for reliable RAG performance.
Monitoring and Observability
Production RAG architectures require monitoring at each layer. Document processing should track extraction success rates and flag format-specific failures. The embedding layer should monitor model latency and detect drift in embedding distributions. Vector storage should track query latency, index health, and storage utilization. Retrieval should measure recall, precision, and the relevance distribution of returned results. And generation should monitor faithfulness, relevance, and user satisfaction metrics. End-to-end tracing that links a user query to its retrieval results and final response is essential for debugging quality issues.
Scaling Considerations
RAG systems must scale along two dimensions: knowledge base size and query throughput. As the knowledge base grows from thousands to millions of documents, the vector database must handle larger indices without degrading search latency. Indexing strategies like partitioning, sharding, and tiered storage help manage growth. Some databases support incremental index updates, while others require periodic full rebuilds that must be scheduled during maintenance windows.
Query throughput scaling depends on whether retrieval or generation is the bottleneck. Retrieval scales horizontally, as vector databases can distribute queries across replicas and partitions. Generation is typically the bottleneck, bounded by the language model's throughput capacity. Caching strategies that store responses for identical or near-identical queries can reduce generation load significantly, especially in customer support and documentation lookup scenarios where many users ask similar questions.
Teams building their first RAG system should start with the simplest architecture that meets their requirements and add complexity only when measured quality gaps justify it. A system with fixed-size chunking, a single embedding model, basic vector search, and a strong prompt can deliver good results. Hybrid retrieval, reranking, and query decomposition should be added incrementally based on where evaluation metrics show the system underperforming.
RAG architecture is a multi-layered system where each component, from document processing through generation, has distinct responsibilities and failure modes. Building a reliable RAG system requires understanding how data flows through each layer and monitoring quality at every stage.