How AI Agent Memory Systems Work
The Memory Loop: A Four-Stage Cycle
The defining idea behind any memory system is that the model does not remember anything on its own. A large language model takes text in and produces text out, retaining nothing between requests. So memory has to be built around the model as a loop that captures information during one interaction and feeds it back during a later one. The loop has four stages, and they always occur in the same order: write, store, retrieve, inject.
These stages divide cleanly into two phases that happen at different times. Writing and storing happen after an interaction, when the system decides what was worth keeping and commits it to durable storage. Retrieving and injecting happen before and during the next interaction, when the system pulls relevant past information back into the context window. The gap between these two phases, which may be seconds or months, is exactly the gap that memory bridges. The relationship between working memory in the context window and the durable store on the other side of this loop is the foundation explored in the types of agent memory.
Stage One: Deciding What to Write
The loop begins with a decision that quietly determines the quality of everything downstream: what is worth remembering. The naive approach stores every message verbatim, which feels safe but quickly fills the store with low-value content. Casual remarks, restated questions, and filler crowd out the genuinely useful facts, and because retrieval has to compete against all that noise, an indiscriminate store actually makes the agent worse at recall, not better.
Stronger systems extract before they store. Rather than keeping a raw transcript, they distill each interaction into a few clean, durable memory entries: the stable facts, the explicit preferences, the confirmed outcomes, and the corrections. Often the language model itself performs this extraction, reading the interaction and summarizing what should be remembered into concise statements. This is the difference between a memory that records that a user said many things and a memory that records that a user prefers email over phone contact. The first is a transcript; the second is knowledge. Deciding what to write well is the single highest-leverage choice in the entire pipeline.
Stage Two: Storing Memory for Retrieval
Once the system knows what to keep, it must store it in a form that makes later retrieval fast and accurate. The dominant technique converts each memory into an embedding, a numeric vector that captures the meaning of the text, and stores that vector alongside the original text and metadata such as a timestamp, a source, and the user it belongs to. The embedding is what enables semantic search, letting the system later find memories by meaning rather than exact wording, a mechanism detailed in embedding models for agent memory.
Most serious systems store memory in more than one representation, because no single structure serves every kind of recall. A vector store handles semantic similarity, a structured database handles exact lookups and filtering by fields like user or date, and a knowledge graph handles relationships between entities. The metadata stored with each memory is as important as the memory itself, since it is what lets the system later filter to the right user, prefer recent entries, or trace where a fact came from. Good storage is not just dumping text into a database; it is encoding each memory so that the retrieval stage can find it precisely when it is needed.
Stage Three: Retrieving the Right Memories
Retrieval is where the loop earns its value and also where most of the difficulty lives. When a new task arrives, the system must search a store that may hold thousands or millions of entries and return the handful that will genuinely help with the task at hand. It typically does this by embedding the incoming query and finding the stored vectors closest to it, frequently combined with keyword matching for exact terms and metadata filters to restrict the search to the right user or time range.
The goal of retrieval is to maximize relevance while respecting a strict budget on how much can be pulled back. Return too little and the agent misses knowledge it actually has; return too much and the useful memories drown among the marginal ones while cost and latency rise. Many systems add a reranking step, scoring the initial candidates with a more capable model and keeping only the best, which sharply improves precision. The full range of approaches, from pure keyword search to dense vector search to hybrid combinations, is compared in memory retrieval strategies, and the dense-search foundation is covered in vector search.
Stage Four: Injecting Memory into Context
The final stage takes the retrieved memories and places them into the model's context window before it generates a response. This is the moment the loop closes: information written in a past session is now back in working memory, available to shape the current answer. Injection usually means formatting the retrieved memories into a clear block of text, often under a heading like known facts or relevant history, and adding it to the prompt alongside the system instructions and the current conversation.
Injection is constrained by the size of the context window, so the system must be disciplined about how much it adds. Every token of memory injected is a token unavailable for instructions, conversation, or the model's own reasoning, and beyond a certain point, adding more memory degrades quality by burying the important details among the marginal ones. The best systems inject a tight, well-ordered set of the most relevant memories rather than everything retrieved, and they format it so the model can tell what is established fact, what is recent history, and what is the current task. How much to inject, and how it relates to overall context limits, is explored in how much memory agents need.
Putting the Loop Together: A Worked Example
Consider a user who tells a personal assistant agent, in January, that they are vegetarian. During the write stage, the system extracts the durable fact, the user is vegetarian, rather than storing the whole chat. During storage, it embeds that fact, tags it with the user's identity and the date, and saves it to the vector store. Months pass and the original conversation is long gone from any context window.
In June, the same user asks the agent to suggest a dinner recipe. Before answering, the retrieval stage embeds the request, searches the user's memories, and surfaces the stored fact that they are vegetarian, because it is semantically related to a request about food. The injection stage adds that fact to the prompt, and the model, now aware of the preference, recommends a vegetarian recipe without the user having to repeat themselves. From the user's perspective the agent simply remembered, but underneath, all four stages ran exactly as designed. This same loop, scaled up with better extraction, multiple stores, and smarter retrieval, is what powers every agent that feels like it knows you, and it is closely related to the broader pattern of retrieval augmented generation.
An agent memory system is a four-stage loop: write the information worth keeping, store it in a retrieval-ready form, retrieve the most relevant pieces when a new task arrives, and inject them into the context window. Writing and storing happen after an interaction; retrieving and injecting happen before the next one. The gap between those phases is exactly what memory bridges, turning a stateless model into an agent that appears to remember.