Embedding Models for AI Agent Memory
What an Embedding Model Does
An embedding model is a neural network trained to map text into a fixed-length vector that captures meaning. Feed it a word, a sentence, or a paragraph, and it returns the same-sized list of numbers every time, with the crucial property that texts meaning similar things produce vectors that are close together. This is what makes semantic search possible: by comparing vectors instead of words, an agent can find a memory about ending a recurring plan when the user asks about cancelling a subscription, because the model placed both phrases near each other in its space.
The model learns this behavior during training by being shown enormous quantities of text and adjusting its parameters so that related passages map to nearby vectors and unrelated ones map far apart. The result is a compact, general-purpose meaning detector. In an agent memory system, the embedding model runs at two moments: when a memory is stored, to produce the vector that gets saved, and when a query arrives, to produce the vector that gets compared against the store. Those two uses must agree, which is why the same model has to handle both. The downstream search itself is covered in vector search.
Dimensions, Context Length, and Other Key Properties
Embedding models differ along a few properties that directly affect memory quality and cost. The first is dimensionality, the length of the output vector, often ranging from a few hundred to a few thousand numbers. Higher-dimensional embeddings can capture finer distinctions in meaning, but they take more storage and make each similarity comparison slightly more expensive. For most agent memory, mid-sized dimensions strike a good balance, and some modern models even let you shorten their vectors to trade a little accuracy for lower storage and faster search.
The second property is the maximum input length, how much text the model can embed at once. A model with a short limit forces you to split long memories into smaller pieces before embedding, while one with a longer limit can embed a whole document section in a single vector. The third is the domain the model was trained on: a model trained mostly on general web text may embed specialized legal, medical, or code content less precisely than one tuned for that domain. Matching the model's strengths to the kind of text your agent stores is what keeps recall sharp on the content that matters most.
Choosing an Embedding Model: API or Open Source
The first major decision is whether to call a hosted embedding model through an API or run an open-source model yourself. API models are the simplest path: a provider hosts a high-quality model, you send text and receive vectors, and you never manage any infrastructure. They tend to offer excellent quality and require no setup, at the cost of a per-call fee and the need to send your text to a third party, which can be a problem for sensitive data. They also mean your memory system depends on that provider remaining available and stable.
Open-source embedding models run on your own hardware, keeping all text local and eliminating per-call fees, which matters enormously when you are embedding millions of memories. Modern open models are strong enough that the quality gap with hosted options is often small for general text. The tradeoff is operational: you provide the compute, manage the model, and handle scaling yourself. This choice mirrors the broader decision between local and cloud memory, and the two often go together, with local memory paired with a local embedding model and cloud memory paired with a hosted one. The hands-on setup is walked through in how to configure embedding models.
Because the best model depends on your content, the reliable way to choose is to evaluate candidates on your own data rather than trusting a public leaderboard. Assemble a small set of realistic queries paired with the memories that should be retrieved for each, run every candidate model over it, and measure how often the correct memory appears near the top of the results. General benchmarks that rank embedding models across many tasks are a useful way to build a shortlist, but they average over domains that may look nothing like yours, and a model that tops a broad leaderboard can underperform a humbler one on your specific text. A few hours spent building this evaluation pays for itself many times over, because the embedding model is costly to change later and quietly sets the quality of every recall the agent will ever perform.
Why Consistency Matters: The Same Model Everywhere
One rule overrides almost every other consideration: you must embed your stored memories and your queries with the same model. Embeddings from different models live in different spaces, so a query vector from one model and a memory vector from another are not comparable, and similarity scores between them are meaningless. Mixing models silently destroys retrieval quality, and because the system still returns results, the failure is easy to miss until recall is mysteriously poor.
This rule has a sharp consequence for changing models. If you decide to switch to a better embedding model later, you cannot simply use it for new memories, because the old vectors were produced by the old model and will no longer be comparable. You must re-embed your entire store with the new model, a process called reindexing, which can be time-consuming and costly for a large memory. For this reason, the embedding model is one of the stickier choices in a memory system, and it is worth selecting carefully up front rather than planning to swap it casually. Treating the embedding model as a long-term commitment, versioned alongside the data it produced, avoids painful surprises.
Practical Tradeoffs: Cost, Speed, and Quality
Choosing an embedding model in practice means balancing three forces. Quality is how faithfully the model captures meaning, which sets the upper bound on recall. Cost includes both the price of generating embeddings, especially at the scale of millions of memories, and the storage cost of the vectors, which grows with their dimensionality. Speed covers how quickly the model can embed text, which matters most at query time, when a user is waiting for the agent to respond and an embedding has to be computed before the search can even begin.
The right balance depends on the application. An agent embedding a huge archive of documents cares most about cost and may favor an efficient model run in batches. An agent embedding short user queries in real time cares most about latency. An agent in a specialized domain cares most about quality on that domain's language. Because these forces pull in different directions, there is no single best embedding model, only the best one for a given workload, and the choice should follow from what your agent actually stores and how fast it must respond. The same embeddings power not just memory but any retrieval augmented generation the agent performs.
An embedding model turns text into the meaning-carrying vectors an agent's memory searches over, so its quality sets the ceiling on recall. Choose based on dimensionality, input length, domain fit, and whether to call a hosted API or run an open model locally, and above all use one model consistently for both storing and querying, since mixing models or switching without reindexing silently breaks retrieval.