How to Run Embeddings with Ollama
Text embeddings convert words and sentences into numerical vectors that capture semantic meaning. Similar concepts produce vectors that are close together in the embedding space, enabling machines to understand relationships between texts that share no exact words in common. A search for "automobile maintenance" can find documents about "car repair" because their embeddings are nearby in vector space, even though the terms differ completely.
Pull an Embedding Model
Ollama's model library includes several embedding models optimized for different use cases. Download one with ollama pull nomic-embed-text for a solid general-purpose model that produces 768-dimensional vectors. For higher accuracy at the cost of larger vectors, use ollama pull mxbai-embed-large which produces 1024-dimensional vectors. For multilingual text, consider ollama pull snowflake-arctic-embed which supports multiple languages effectively.
Embedding models are much smaller than generation models, typically ranging from 250MB to 1.5GB. They load quickly, consume minimal VRAM, and process text much faster than generation models since they only need a single forward pass through the network rather than generating tokens one at a time. You can comfortably run an embedding model alongside a generation model on the same GPU.
The choice of embedding model affects your entire pipeline. Once you embed your document collection with a specific model, you must use the same model for query embeddings at search time. Switching models requires re-embedding your entire collection because different models produce incompatible vector spaces. Choose your embedding model carefully before processing large document sets.
Generate Embeddings via the API
The Ollama API provides the /api/embed endpoint for embedding generation. Send a POST request with the model name and your input text. For a single string, set "input": "Your text here". For batch processing, set "input": ["First text", "Second text", "Third text"] to embed multiple texts in one request. The response contains an embeddings array with one vector per input.
Using the Python client library, call ollama.embed(model='nomic-embed-text', input='Your text here') for the same result in a cleaner interface. The Python client handles JSON serialization, HTTP connection management, and response parsing automatically. Batch processing works the same way by passing a list of strings as the input parameter.
Each embedding vector is a list of floating-point numbers whose length depends on the model. Nomic-embed-text produces 768 numbers per vector, mxbai-embed-large produces 1024. These vectors are what you store and search against. The raw numbers have no human-readable meaning on their own, but their relative distances and directions encode semantic relationships between the original texts.
Store Vectors in a Database
Raw embedding vectors need a vector database for efficient storage and similarity search. ChromaDB is a popular choice for Python projects because it runs in-process with no separate server needed. Install it with pip install chromadb, create a collection, and add your documents with their embeddings. ChromaDB handles indexing automatically and provides fast nearest-neighbor search.
For production deployments, Qdrant and pgvector offer more robust options. Qdrant runs as a standalone service with a REST API, supports filtering, and handles large collections efficiently. pgvector adds vector search capabilities to PostgreSQL, letting you store embeddings alongside traditional relational data in the same database. Both options scale better than in-process solutions for collections with millions of documents.
When storing vectors, always save the original text or a reference to it alongside the vector. You need the source text when retrieving similar documents, as the vector itself cannot be converted back to text. Store metadata like document titles, URLs, timestamps, and categories to enable filtered search where you can restrict results to specific document types or date ranges.
Query with Semantic Search
At query time, embed the user's search query with the same model used for the document collection. Pass the query vector to your vector database's search function, which returns the most similar document vectors ranked by distance. Common distance metrics include cosine similarity (most popular for text embeddings), dot product, and Euclidean distance. Most vector databases default to cosine similarity.
The search results give you the most semantically relevant documents from your collection. In a RAG application, take the top 3 to 5 results and pass their text as context to a generation model along with the user's question. The generation model reads the retrieved context and produces an answer grounded in your specific documents rather than relying solely on its training data.
Tune search quality by adjusting chunk size during document processing. Smaller chunks (200 to 500 tokens) produce more precise matches but may miss broader context. Larger chunks (500 to 1000 tokens) capture more context but may dilute the relevance signal with surrounding text. Experiment with different chunk sizes on your specific document collection to find the optimal balance for your use case.
Building a RAG Pipeline
A complete RAG pipeline with Ollama has three phases: ingestion, storage, and retrieval. During ingestion, read your documents, split them into chunks with optional overlap between chunks, and generate embeddings for each chunk using the embed endpoint. During storage, insert the chunks and their vectors into your vector database with any relevant metadata. During retrieval, embed the user query, search for similar chunks, and pass them to a chat model for answer generation.
Chunk overlap (typically 50 to 100 tokens shared between consecutive chunks) helps preserve context that would otherwise be split across chunk boundaries. Without overlap, a sentence that spans two chunks might lose its meaning in both. With overlap, at least one chunk contains the complete sentence, improving retrieval accuracy for queries that target that information.
The generation step uses Ollama's chat endpoint with the retrieved chunks inserted into the system message or user message as context. A common prompt pattern instructs the model to answer the question based only on the provided context, and to say it does not know if the context does not contain relevant information. This grounding prevents the model from generating answers from its general knowledge when the retrieved documents do not cover the topic.
Embedding Model Comparison
Nomic-embed-text is the most widely used embedding model on Ollama. It produces 768-dimensional vectors, has a context window of 8192 tokens, handles English text well, and runs efficiently on modest hardware. Its balance of quality, speed, and resource usage makes it the default recommendation for most projects.
Mxbai-embed-large produces 1024-dimensional vectors and generally achieves higher retrieval accuracy than nomic-embed-text on standard benchmarks. The larger dimensions provide more room for encoding semantic nuances, which benefits applications where retrieval precision is critical. The tradeoff is slightly higher computation time and larger storage requirements for the vectors.
Snowflake-arctic-embed comes in multiple sizes (xs, s, m, l) and performs well on both English and multilingual text. The smallest variant is extremely fast and suitable for real-time applications where latency matters more than maximum accuracy. The largest variant competes with mxbai-embed-large on accuracy while offering better multilingual coverage.
Batch Processing and Performance
For large document collections, batch processing is essential. Instead of embedding one text at a time, group your chunks into batches of 50 to 100 and send them in a single API call. The model processes batched inputs more efficiently than sequential single inputs because it amortizes the model loading and context switching overhead across all items in the batch.
Processing speed depends on the embedding model size and your hardware. On a modern GPU, nomic-embed-text can embed roughly 100 to 300 chunks per second in batch mode. On CPU, expect 10 to 30 chunks per second. For very large collections (millions of documents), consider running the embedding job overnight or distributing it across multiple machines, each running its own Ollama instance.
Monitor memory usage during batch processing, especially on GPU. Embedding models are small, but large batch sizes with long text chunks can temporarily require significant VRAM for the input token processing. If you encounter out-of-memory errors during batch embedding, reduce the batch size or the maximum chunk length. The total embedding quality is identical regardless of batch size.
Ollama makes local embedding generation straightforward with dedicated models like nomic-embed-text and mxbai-embed-large. Combined with a vector database, these embeddings enable semantic search and RAG pipelines that keep your data entirely on your own hardware while delivering retrieval quality comparable to cloud embedding services.