How to Build RAG with Ollama and Local Models

Updated May 2026
Building RAG with Ollama lets you run the entire pipeline locally, with no data leaving your machine and no API costs. This is valuable for privacy-sensitive applications, offline environments, development and experimentation without usage fees, and organizations that cannot send data to external services. Ollama provides a simple interface for running open-source language models locally, and when combined with local embedding models and a local vector database, it creates a fully self-contained RAG system.

This guide walks through building a complete RAG pipeline where every component runs on your own hardware. The tradeoff is that local models are less capable than the largest cloud-hosted models, and performance depends on your hardware specifications. However, for many practical use cases, a well-configured local RAG system produces useful results while maintaining complete data privacy.

Step 1: Install Ollama and Choose a Model

Download and install Ollama from the official site. It supports macOS, Linux, and Windows. After installation, Ollama runs as a background service that serves models through a local API endpoint, typically at localhost port 11434.

Pull a model that fits your hardware. For machines with 8 GB of RAM, use Llama 3.1 8B or Mistral 7B, which provide solid generation quality for RAG at the smallest practical size. For 16 GB or more, Llama 3.1 8B runs comfortably with room for the embedding model and vector database. For 32 GB or more, you can run larger models like Llama 3.1 70B (quantized) or Qwen2 72B (quantized) for higher quality generation.

Quantized models (Q4_K_M, Q5_K_M) reduce memory usage significantly with modest quality impact. For RAG specifically, the quality difference between a Q4 quantized model and a full-precision model is smaller than for open-ended generation, because the retrieved context provides the factual grounding that the model needs.

Test your chosen model with a few queries to verify it runs at acceptable speed on your hardware. For RAG, you need generation fast enough that users do not abandon the query while waiting. On a modern laptop CPU, expect 5 to 15 tokens per second for a 7B model. With a GPU, expect 30 to 100+ tokens per second depending on the GPU and model size.

Step 2: Set Up Local Embeddings

For local embeddings, use the sentence-transformers library in Python with a model like all-MiniLM-L6-v2 (384 dimensions, fast, low memory) or BGE-small-en-v1.5 (384 dimensions, higher quality). For multilingual content, use BGE-M3 (1024 dimensions) which handles over 100 languages. These models run on CPU without issues, though GPU acceleration speeds up batch embedding for large knowledge bases.

Ollama itself also supports embedding models. You can pull nomic-embed-text or mxbai-embed-large through Ollama and use the same API interface for both embeddings and generation. This simplifies the stack by using a single tool for both tasks, though the sentence-transformers library offers a wider selection of embedding models.

Whichever embedding approach you choose, test it on a sample of your documents before committing to a full indexing run. Generate embeddings for a few query-document pairs you know are relevant and verify that cosine similarity scores are meaningfully higher for relevant pairs than for random pairs. This quick sanity check confirms the model captures the semantic relationships in your content.

Local embedding eliminates the per-token cost of API-based embedding services. For large knowledge bases with hundreds of thousands of chunks, this cost savings is substantial. The tradeoff is that local embedding models are typically less powerful than the best API models, which may affect retrieval accuracy on specialized content.

Step 3: Configure a Local Vector Database

Chroma is the natural choice for local RAG. It runs in-process as a Python library with no separate server, stores data in a local directory, and requires zero configuration. Install it with pip, create a collection with the correct embedding dimensions, and start inserting vectors. For a local RAG system, Chroma provides everything you need without operational overhead.

If you want a more robust local option, Qdrant runs as a single Docker container and provides better performance at scale, richer filtering capabilities, and a proper client-server architecture. For a local development setup, run Qdrant with Docker and configure it to persist data to a local directory. This gives you the benefits of a production-grade vector database while keeping everything on your machine.

For PostgreSQL users, pgvector keeps the entire stack within a single database. Install the extension, create a table with a vector column, and query using SQL. This approach is especially convenient if your knowledge base metadata already lives in PostgreSQL, since you can join vector search results with metadata tables in a single query.

Create your collection with dimensions matching your embedding model (384 for all-MiniLM or BGE-small, 1024 for BGE-M3, 768 for nomic-embed-text). Use cosine similarity as the distance metric. Add metadata fields for source document path, section heading, and any filtering attributes you need during retrieval.

Step 4: Build the Indexing Pipeline

Create a Python script that processes your document collection through chunking, embedding, and storage. Load each document, split it into chunks (start with 512 tokens and 50-token overlap), generate embeddings using your local model, and upsert the embeddings with metadata into your vector database.

For document loading, use PyMuPDF for PDFs, python-docx for Word files, and BeautifulSoup for HTML. For text files and Markdown, read them directly and split on natural boundaries. Attach metadata to each chunk: the source file path, page number or section heading, and the chunk index within the document.

Batch the embedding generation for efficiency. Processing chunks one at a time is slow. Instead, collect chunks into batches of 32 to 64 and embed them together. On CPU, this reduces overhead from model loading and memory allocation. On GPU, batching fills the compute pipeline more efficiently.

For a knowledge base of a few hundred documents, the indexing process runs in minutes on a modern laptop. For larger collections (thousands of documents), expect the embedding step to take longer, especially on CPU. Monitor progress and save intermediate results so you can resume if the process is interrupted.

Step 5: Build the Query Pipeline

The query pipeline connects retrieval to generation. When a user submits a query, embed it using the same model used for indexing, search the vector database for the top k similar chunks (start with k=5), assemble the retrieved chunks into a context string, construct a prompt with the context and the user question, and send it to Ollama for generation.

Use the Ollama Python library or make HTTP requests to the local API. The chat endpoint accepts a system message (your RAG prompt), a user message (the query with context), and model parameters. Set the system message to instruct the model to answer based only on the provided context, cite sources, and indicate when information is insufficient.

Format the context clearly so the model can distinguish between different sources. Number each chunk and include its source document name. A simple format like "[Source 1: filename.pdf, page 3]" followed by the chunk text gives the model enough structure to provide accurate citations in its response.

Set the temperature to 0.1 to 0.3 for factual RAG responses. Local models are more prone to hallucination than larger cloud models, and lower temperature reduces this tendency. Also set a reasonable max_tokens limit (512 to 1024 for most Q&A use cases) to prevent the model from generating unnecessarily long responses that drift from the context.

Step 6: Optimize for Quality and Speed

Context window management: Local models have smaller context windows than cloud models (typically 4096 to 8192 tokens for 7B models, though some support up to 32K or 128K). Calculate how much of the context window your retrieved chunks consume and leave room for the system prompt and the generated response. If your top-5 chunks total 2000 tokens and you want a 500-token response with a 200-token system prompt, you need at least a 2700-token context window. Reduce top-k or chunk size if you are hitting context limits.

Model selection for RAG: For RAG specifically, instruction-tuned models (the "instruct" or "chat" variants) perform better than base models because they follow the system prompt more reliably. Llama 3.1 8B Instruct, Mistral 7B Instruct, and Phi-3 Mini are all good choices for local RAG. Test each on your evaluation queries to find which handles your content domain best.

GPU acceleration: If you have an NVIDIA GPU with at least 6 GB of VRAM, Ollama automatically uses it for inference, providing a large speed improvement over CPU. For AMD GPUs, Ollama supports ROCm on Linux. For Apple Silicon Macs, Ollama uses the Metal framework for GPU acceleration. Check that Ollama detects your GPU by running a simple generation test and comparing the tokens-per-second output.

Caching and warm starts: Ollama keeps recently used models loaded in memory. If you run queries infrequently, the model may be unloaded between queries, causing a cold start delay. Configure Ollama keep-alive settings to keep the model loaded for longer periods if latency on the first query matters for your use case.

When Local RAG Makes Sense

Local RAG with Ollama is the right choice when data privacy prevents sending documents to external APIs, when you need to work offline or in air-gapped environments, when you want to experiment and iterate without per-query costs, or when your organization has policies against using cloud AI services for certain data types. It is also useful for development and testing, where you can build and debug your RAG pipeline locally before deploying with cloud models in production.

Local RAG is not the best choice when you need the highest possible generation quality (cloud models like GPT-4o and Claude are still more capable than local 7B to 13B models), when your hardware is limited (under 8 GB RAM makes local inference impractical), or when you need to serve many concurrent users (local models handle one request at a time unless you set up a more complex serving infrastructure).

Scaling Beyond a Single Machine

For team use, you can run Ollama on a shared server with a GPU and expose it to your local network. Team members connect their RAG applications to the shared Ollama instance instead of running models on their individual machines. This centralizes the hardware requirement while keeping data within your network.

For production workloads, consider vLLM or text-generation-inference (TGI) as alternatives to Ollama. These serving frameworks support concurrent requests, continuous batching, and more advanced scheduling that Ollama does not provide. They are more complex to set up but necessary if your local RAG system needs to handle multiple simultaneous users.

Key Takeaway

A fully local RAG system with Ollama, local embeddings, and a local vector database gives you complete data privacy and zero API costs. Use a 7B to 8B parameter instruction-tuned model for generation, a lightweight embedding model like all-MiniLM or BGE-small, and Chroma for vector storage. The quality is lower than cloud-hosted models, but sufficient for many practical use cases, and improves steadily as open-source models advance.