How to Configure Memory for Self-Hosted Agents

Updated May 2026

Memory gives AI agents the ability to retain information across conversations, reference your documents and data, and build context over time. Without memory, every agent interaction starts from scratch. Configuring effective memory for self-hosted agents involves choosing the right memory types, deploying a vector database, setting up embedding models, and building a document ingestion pipeline.

Agent memory is not a single system but a combination of approaches, each serving different purposes. Understanding the types of memory available helps you build an architecture that matches your agents' needs.

Step 1: Choose Your Memory Architecture

Conversation buffer memory stores the raw conversation history and passes it to the model with each new message. This is the simplest approach and works well for short conversations. The limitation is context window size: as conversations grow, they eventually exceed the model's maximum context length. A 7B model with an 8K context window can hold roughly 6,000 words of conversation history.

Conversation summary memory periodically summarizes older conversation segments and stores the summaries instead of raw messages. This compresses conversation history, allowing agents to maintain context over much longer interactions. The tradeoff is that details from earlier in the conversation may be lost in summarization.

Vector-based RAG memory stores information as vector embeddings in a database and retrieves relevant context through semantic search. When the agent receives a query, it searches the vector database for relevant documents, past conversations, or knowledge entries, and includes the retrieved context in the prompt. This approach scales to millions of documents and enables agents to access knowledge far beyond what fits in a context window.

Most production agent systems combine all three: buffer memory for the current conversation, summary memory for older conversation context, and vector RAG for document retrieval and long-term knowledge. Start with one approach and add others as your needs evolve.

Step 2: Set Up a Vector Database

pgvector is the recommended starting point. It adds vector operations to PostgreSQL, which means you can store vectors alongside regular relational data in a database you may already be running. pgvector supports exact and approximate nearest neighbor search, handles millions of vectors, and requires no additional infrastructure if you already use PostgreSQL.

To deploy pgvector, use a PostgreSQL Docker image with pgvector pre-installed (such as pgvector/pgvector:pg16). Enable the extension in your database by running CREATE EXTENSION vector. Then create tables with vector columns to store your embeddings.

Qdrant is an alternative for teams that want a dedicated vector database with more advanced features. Qdrant provides filtering, payload storage, and optimized indexing algorithms out of the box. It deploys as a single Docker container and exposes a REST API. Qdrant handles larger datasets and more complex queries than pgvector but adds another service to your stack.

If you use Dify, it includes built-in vector storage and document management. You may not need to deploy a separate vector database unless you want more control over the storage layer or have requirements beyond what Dify's built-in system provides.

Step 3: Configure Embedding Models

Embedding models convert text into numerical vectors that capture semantic meaning. When you store a document chunk, the embedding model converts it to a vector. When the agent queries memory, the query is also converted to a vector, and the database finds stored vectors with similar meaning.

For self-hosted deployments, run your embedding model locally rather than calling a cloud API. This keeps your document content private and eliminates per-request costs. Ollama can serve embedding models alongside language models. Popular choices include nomic-embed-text (768 dimensions, excellent quality for its size), mxbai-embed-large (1024 dimensions, higher quality), and bge-large (1024 dimensions, strong multilingual support).

Embedding model dimensions affect storage size and search speed. A 768-dimension model uses 3 KB per vector. A 1024-dimension model uses 4 KB. For a corpus of 100,000 document chunks, the difference is 300 MB versus 400 MB of vector storage, which is negligible on modern hardware. Choose the model with the best retrieval quality for your use case rather than optimizing for storage size.

Step 4: Build Your Document Pipeline

The document pipeline transforms raw documents into searchable vector entries. This involves three sub-steps: loading, chunking, and embedding.

Loading extracts text from your source documents. Different document types need different loaders: PDF parsers for PDFs, HTML parsers for web pages, plain text readers for text files, and office document parsers for Word and Excel files. Dify and LangChain include loaders for common formats. For specialized formats, you may need custom loaders.

Chunking splits documents into segments suitable for embedding and retrieval. The optimal chunk size depends on your content type and retrieval needs. For general documents, chunks of 512 to 1024 tokens work well. For technical documentation with distinct sections, align chunks with document structure (headers, paragraphs) rather than arbitrary token counts. Include overlap between chunks (50 to 100 tokens) so that information at chunk boundaries is not lost.

Embedding and indexing passes each chunk through the embedding model and stores the resulting vector along with the original text in the vector database. This step can be batch-processed for large document collections. For a typical setup, embedding 10,000 document chunks takes 5 to 15 minutes on a mid-range GPU.

Step 5: Connect Memory to Your Agent

Configure your orchestration platform to use the vector database as a knowledge source. In Dify, this is done through the Knowledge section where you associate knowledge bases with agent applications. In LangGraph or CrewAI, you integrate retrieval tools that query the vector database and return relevant chunks.

Set retrieval parameters: the number of chunks to retrieve per query (3 to 5 is typical), the similarity threshold below which results are filtered out, and whether to use reranking to improve result relevance. Start with conservative settings and adjust based on testing.

For conversation memory, configure the memory backend in your orchestration platform. Most platforms store conversation history in their application database by default. Verify that conversation history persists across sessions by starting a conversation, closing the browser, and returning to verify the history is preserved.

Test the complete memory pipeline end to end. Upload a document, wait for processing to complete, then ask the agent a question that requires information from that document. The agent should retrieve relevant context and incorporate it into its response. If responses lack document context, check the embedding pipeline logs, verify the vector database contains the expected entries, and test retrieval queries directly against the database.

Key Takeaway

Effective agent memory combines conversation history for session context with vector-based RAG for document retrieval. Start with pgvector for storage and a local embedding model like nomic-embed-text, then expand to more specialized solutions as your document corpus and agent complexity grow.

Step 1: Choose Your Memory Architecture

Step 2: Set Up a Vector Database

Step 3: Configure Embedding Models

Step 4: Build Your Document Pipeline

Step 5: Connect Memory to Your Agent

Related Articles

How to Set Up Self-Hosted AI Agents with Docker

How to Launch Your First Self-Hosted AI Agent

Retrieval-Augmented Generation Guide

AI Agent Memory Systems