How to Set Up a Local Embedding Server
Step 1: Choose an Embedding Model
Embedding models convert text into fixed-length vectors (typically 384-1536 dimensions) where semantically similar text produces similar vectors. The most popular self-hosted embedding models in 2026:
nomic-embed-text (137M parameters): The most popular choice for local RAG. Produces 768-dimension vectors with strong performance on the MTEB benchmark. Runs on CPU with minimal resources. Available directly through Ollama.
mxbai-embed-large (335M parameters): Higher quality embeddings with 1024 dimensions. Better retrieval accuracy than nomic-embed-text at a modest increase in resource usage. Good for applications where retrieval quality directly affects output quality.
all-minilm (33M parameters): The smallest practical option, producing 384-dimension vectors. Ideal for resource-constrained environments or when embedding speed matters more than maximum quality.
snowflake-arctic-embed (110M parameters): Optimized for retrieval tasks specifically, with strong performance on search-oriented benchmarks. A good choice if your primary use case is document search rather than general semantic similarity.
Step 2: Set Up with Ollama
The easiest path to a local embedding server runs through Ollama. Install Ollama if you have not already, then pull an embedding model: ollama pull nomic-embed-text. The model downloads in seconds (typically under 300MB) and is immediately available through the API.
To generate embeddings, send a POST request to http://localhost:11434/api/embed with the model name and text. The response contains a vector array that you can store in a vector database or use for similarity calculations directly.
Ollama can serve both an embedding model and a language model simultaneously. The embedding model uses minimal memory (under 500MB), leaving the majority of your resources available for the language model. This dual-model setup is the foundation of a fully local RAG pipeline.
Step 3: Connect a Vector Database
Embeddings need to be stored in a vector database for efficient similarity search. The database indexes the vectors so that finding the most similar documents to a query takes milliseconds, even across millions of documents.
ChromaDB is the simplest option for getting started. It runs as a Python library (no separate server needed) and stores vectors locally. Install with pip and initialize a collection in a few lines of code. Suitable for development and small-to-medium document collections (up to a few hundred thousand documents).
Qdrant is a dedicated vector database that runs as a standalone server. It handles larger collections efficiently, supports filtering and metadata queries alongside vector search, and provides a REST API. Run it via Docker for quick setup.
pgvector adds vector search capability to PostgreSQL. If your application already uses PostgreSQL, pgvector lets you store embeddings alongside your existing data without adding a new database to your infrastructure.
Step 4: Build a RAG Pipeline
A complete local RAG pipeline works as follows: documents are chunked into passages (typically 200-500 tokens each), each passage is embedded using the local embedding model, and the vectors are stored in the vector database. When a user asks a question, the question is embedded, the vector database returns the most similar passages, and those passages are included in the prompt to the language model as context for generating an answer.
The entire pipeline runs locally: embedding model via Ollama, vector database on the same machine, and language model via Ollama or vLLM. No data leaves your infrastructure at any point.
For chunking strategy, overlap adjacent chunks by 10-20% to avoid losing context at chunk boundaries. Use semantic chunking (splitting on paragraph or section boundaries) rather than fixed character counts when possible. Index chunk metadata (source document, page number, section title) alongside the vectors to provide source attribution in responses.
A local embedding server using Ollama and nomic-embed-text takes minutes to set up, uses minimal resources, and enables fully private RAG pipelines. Pair it with ChromaDB for development or Qdrant for production workloads.