The Embedding Layer: Vector Search and Similarity

Updated May 2026
The embedding layer converts text into numerical vectors that capture semantic meaning, then stores and searches those vectors to find relevant information. This is the core technology behind retrieval-augmented generation (RAG), enabling your AI models to answer questions about documents, codebases, and knowledge bases they were never trained on.

How Embeddings Work

An embedding model is a neural network trained to convert text into a dense numerical vector, typically an array of 384 to 1536 floating-point numbers. The training process teaches the model to place semantically similar texts close together in vector space: "how to train a dog" and "puppy obedience training" produce vectors that are nearly identical, while "quantum physics equations" produces a vector far away from both. This geometric property enables similarity search, where you find stored text similar to a query by measuring the distance between their vectors.

The distance between vectors is measured using cosine similarity (the angle between vectors, ranging from -1 to 1) or Euclidean distance (the straight-line distance between points). Cosine similarity is more commonly used because it handles vectors of different magnitudes well and produces intuitive results: a similarity of 1.0 means identical meaning, 0.0 means unrelated, and values between indicate degrees of semantic overlap.

Embedding models are much smaller and faster than language models. nomic-embed-text, one of the most popular options, has about 137 million parameters and runs quickly on CPU. Embedding a 512-token passage takes single-digit milliseconds on modern hardware. This makes it practical to embed large document collections (millions of passages) and to generate query embeddings in real-time without GPU acceleration.

Vector Databases

A vector database stores embedding vectors with associated metadata and provides fast approximate nearest-neighbor (ANN) search. The "approximate" part is important: exact nearest-neighbor search in high-dimensional spaces is computationally expensive (it scales linearly with database size), so vector databases use indexing algorithms like HNSW (Hierarchical Navigable Small World) to find very-close-to-exact results in logarithmic time.

Qdrant is the leading self-hosted vector database for AI applications. Written in Rust, it handles millions of vectors with consistent sub-millisecond search latency. Its distinguishing feature is payload filtering, which lets you combine vector similarity search with metadata constraints. For example, you can search for the most semantically similar documents that were also created within the last month, or that belong to a specific user. This hybrid search capability is essential for production RAG systems where context relevance depends on more than semantic similarity alone.

PostgreSQL with pgvector provides vector search as an extension to an existing relational database. Version 0.8 introduced HNSW indexing that performs within 10 to 20 percent of dedicated vector databases for most workloads. The advantage is operational simplicity: if you already run PostgreSQL, you add vector search without deploying and managing a separate database. Your vectors live alongside your relational data, participate in transactions, and are included in your existing backup strategy.

ChromaDB targets developers who want the simplest possible setup. It runs in-process (embedded in your Python application) or as a standalone server, stores data in SQLite by default, and requires minimal configuration. Its limitation is scale: ChromaDB works well for prototyping and small collections (under 500,000 vectors) but lacks the production features (replication, backup automation, clustering) needed for larger deployments.

Choosing an Embedding Model

nomic-embed-text is the default recommendation for English-language RAG systems. It produces 768-dimensional vectors that balance quality and storage efficiency, runs quickly on CPU, and is available through Ollama with a simple pull command. Its retrieval quality is competitive with larger models on standard benchmarks, and the 768 dimensions keep your vector storage requirements manageable.

BGE (BAAI General Embedding) models offer strong multilingual support. If your documents include multiple languages or your users query in different languages, BGE models capture cross-lingual semantics better than English-focused alternatives. They are available in several sizes (small, base, large) so you can choose the quality-speed tradeoff that fits your deployment.

For maximum retrieval accuracy, the E5 and GTE model families represent the current state of the art. These larger models (up to 1024 or 1536 dimensions) achieve higher scores on retrieval benchmarks at the cost of slower embedding speed and larger storage requirements. The quality difference is measurable on benchmarks but may not matter in practice unless your RAG system handles highly nuanced queries across large, diverse document collections.

Document Chunking Strategies

Before documents can be embedded, they must be split into chunks. Each chunk becomes a separate vector in the database, and when a user queries, the system retrieves the most relevant chunks as context for the LLM. The chunking strategy directly affects retrieval quality: chunks that are too small lose context and produce fragmented, confusing results. Chunks that are too large dilute the relevant information with surrounding noise, reducing the precision of similarity search.

Fixed-size chunking (splitting at every 512 tokens with 50-token overlap) is the simplest approach and a reasonable starting point. The overlap ensures that sentences split at chunk boundaries appear in both adjacent chunks, reducing the chance of losing context at split points. Most RAG tutorials and default configurations use this method.

Structure-aware chunking uses document formatting (headings, paragraphs, code blocks, list items) to find natural split points. A Markdown document gets split at heading boundaries. A code file gets split at function boundaries. An HTML page gets split at section boundaries. This approach produces more coherent chunks that better represent the document's logical structure, improving retrieval quality for well-formatted content.

Recursive chunking combines both approaches: it tries to split at the largest structural boundary that fits within the target chunk size, falling back to smaller boundaries (paragraph, sentence, word) when sections are too long. LangChain's RecursiveCharacterTextSplitter implements this pattern and is widely used in production RAG systems.

Building a RAG Pipeline

A complete RAG pipeline connects the embedding layer to the LLM layer through four steps: chunk the source documents, embed each chunk and store the vectors, embed the user's query at request time, and retrieve the top-k most similar chunks to include in the LLM prompt as context. The retrieved chunks appear in the system prompt or as reference material, giving the LLM specific information to base its response on rather than relying solely on its training data.

Retrieval parameters that matter include top-k (how many chunks to retrieve, typically 3 to 10), similarity threshold (a minimum score below which results are discarded as irrelevant), and re-ranking (using a cross-encoder model to reorder the initial results for higher precision). Tuning these parameters for your specific documents and query patterns has a larger impact on RAG quality than choosing a slightly better embedding model.

Monitoring Retrieval Quality

The quality of your RAG system depends on whether the retrieved chunks actually contain information relevant to the user's query. Poor retrieval means the model receives irrelevant context, which leads to inaccurate or hallucinated responses regardless of how good the model itself is. Monitoring retrieval quality requires logging the retrieved chunks alongside each query and periodically reviewing whether the top results are genuinely relevant.

Build a test set of representative queries with known relevant documents. Run these queries against your RAG pipeline periodically and measure precision (what fraction of retrieved chunks are actually relevant) and recall (what fraction of relevant documents are actually retrieved). When these metrics drop below acceptable thresholds, investigate the cause: it might be poor chunking that splits relevant context across chunk boundaries, an embedding model that struggles with your domain vocabulary, or stale index data that no longer reflects the current document collection.

Re-ranking improves retrieval precision by using a cross-encoder model to reorder the initial vector search results. A cross-encoder evaluates each query-document pair independently, producing a more accurate relevance score than vector similarity alone. The tradeoff is speed: cross-encoders are slower than vector search because they process each pair separately rather than using pre-computed vectors. The common pattern is to retrieve a larger set of candidates with vector search (top 20 to 50) and then re-rank to select the final top 3 to 5 for inclusion in the prompt. This two-stage approach combines the speed of vector search with the accuracy of cross-encoder scoring.

Key Takeaway

The embedding layer turns unstructured documents into searchable knowledge. Start with nomic-embed-text, Qdrant or pgvector, and 512-token chunks with overlap. Then tune your chunking strategy and retrieval parameters based on the actual quality of answers your system produces.