How to Build a RAG Pipeline from Scratch

Updated May 2026
Building a RAG pipeline from scratch gives you full control over every component and a deep understanding of how retrieval augmented generation works in practice. This guide walks through each step from preparing your knowledge base to deploying a production-ready system, with practical recommendations at each stage.

This guide assumes you have a collection of documents you want to make searchable and a language model you want to augment with retrieval. By the end, you will have a working pipeline that ingests documents, embeds them, stores the embeddings, retrieves relevant chunks for each query, and generates grounded responses.

Step 1: Prepare Your Knowledge Base

Start by gathering all the documents that should be searchable. Common sources include help documentation, product manuals, internal wikis, API references, and PDF reports. Organize them in a single directory or configure access to their storage locations (S3 buckets, database connections, API endpoints).

Clean the raw content by removing navigation elements, duplicate headers and footers, boilerplate text, and formatting artifacts. For PDFs, use a quality parser like PyMuPDF, Unstructured, or Docling that handles tables, multi-column layouts, and embedded images. For HTML, strip tags while preserving semantic structure like headings and lists. The quality of parsing directly affects everything downstream.

Catalog your document types and volumes. A knowledge base of 500 help articles has different requirements than one with 50,000 research papers. Understanding the scale helps you choose appropriate infrastructure for later steps.

Step 2: Choose and Configure Chunking

Start with fixed-size chunking at 512 tokens with 50-token overlap. This is the simplest strategy and provides a solid baseline for most content types. Use your embedding model maximum input length as the upper bound for chunk size.

If your documents have clear structural markers (headings, section breaks), try recursive chunking that splits at these boundaries first, then falls back to paragraph and sentence splits for sections that exceed the target size. For code files, use AST-based chunking that respects function and class boundaries.

Attach metadata to each chunk: source document path, section heading, page number, and any relevant tags. This metadata enables filtering during retrieval and helps the generator cite its sources accurately.

Step 3: Select an Embedding Model

For a first implementation, use OpenAI text-embedding-3-small (1536 dimensions, cost-effective) or BGE-M3 (1024 dimensions, open-source, multilingual). Both handle English text well and support input lengths up to 8192 tokens, which accommodates most chunking strategies.

If you choose an API-based model, you will need an API key and a client library. If you choose a self-hosted model, you will need a server with a GPU (or CPU inference for smaller models) running a model serving framework like sentence-transformers, TEI (Text Embeddings Inference), or vLLM.

Test the embedding model on a sample of your chunks to verify that semantically similar chunks produce similar vectors. Compute similarity between a few manually identified relevant pairs and verify the scores are meaningfully higher than random pairs.

Step 4: Set Up a Vector Database

For prototyping, use Chroma (in-process, no server needed) or a hosted service like Pinecone (managed, no infrastructure). For production, choose based on your scale and ops requirements: Pinecone for managed simplicity, Qdrant or Weaviate for self-hosted performance, or pgvector if you already run PostgreSQL.

Create a collection (or index) configured for your embedding dimensions and similarity metric (cosine similarity is the default for most embedding models). Configure metadata fields that you want to filter on during retrieval.

Step 5: Build the Indexing Pipeline

Create a script or service that processes your document collection through the full indexing flow: load each document, chunk it, embed each chunk, and upsert the embeddings with their metadata into the vector database. For small collections (under 10,000 chunks), a simple script that processes documents sequentially works fine. For larger collections, batch the embedding API calls and parallelize the ingestion.

Track which documents have been indexed and their version hashes so you can detect changes and re-index only modified documents on subsequent runs. This incremental indexing approach keeps the knowledge base current without re-processing the entire collection every time.

Step 6: Build the Query Pipeline

The query pipeline handles real-time user requests. It embeds the user query using the same model, searches the vector database for the top-k most similar chunks (start with k=5), assembles the retrieved chunks into a context string, constructs a prompt with the context and the user question, and sends the prompt to the language model for generation.

The system prompt should instruct the model to answer based on the provided context, cite which chunks informed its answer, and indicate when the context does not contain sufficient information. A good starting prompt: "Answer the following question using only the provided context. If the context does not contain the answer, say so. Cite the source for each claim."

Step 7: Add Reranking and Hybrid Search

Once the basic pipeline works, improve retrieval quality by adding two enhancements. First, add keyword search (BM25) alongside vector search and merge results using reciprocal rank fusion. This catches exact term matches that embedding models may miss. Second, add a cross-encoder reranker (Cohere Rerank, Jina Reranker, or an open-source model) that rescores the top 20-50 initial results for more accurate relevance ranking.

These two additions typically provide the largest quality improvement over the basic pipeline, especially for queries containing specific names, product codes, or technical identifiers that embedding models handle poorly.

Step 8: Evaluate and Iterate

Build an evaluation set of 50-100 representative queries paired with their correct source documents. Measure recall at 5, precision at 5, and faithfulness. These baseline metrics tell you where to focus optimization. Low recall means your retriever is missing documents (try different chunk sizes or embedding models). Low precision means too many irrelevant chunks reach the generator (add reranking, reduce top-k). Low faithfulness means the generator is not using the context properly (revise the system prompt, try a different model).

Iterate on one component at a time, measuring the impact of each change against your evaluation set. This disciplined approach prevents the common trap of making multiple changes simultaneously and not knowing which one helped or hurt.

Deployment and Monitoring

Once your pipeline passes evaluation, deploy it as a service with clear API boundaries between components. Separate the indexing pipeline (batch processing, runs on schedule or triggered by document changes) from the query pipeline (real-time, handles user requests). This separation lets you update the knowledge base without disrupting query handling and allows each pipeline to scale independently based on its specific resource demands.

Monitor query latency, retrieval result counts, and generation quality in production. Log the full pipeline execution for each query: the raw query, the embedding, the retrieved chunks with their scores, the assembled prompt, and the generated response. These logs enable debugging when users report incorrect answers and provide data for expanding your evaluation set with real-world query patterns.

Key Takeaway

Build the simplest possible pipeline first (chunk, embed, search, generate), measure its quality, then add complexity (hybrid search, reranking, query rewriting) based on where the metrics show the system underperforming. The best RAG systems are built incrementally, not designed all at once.