How to Set Up a Vector Database for RAG

Updated May 2026
The vector database is the storage and retrieval engine at the center of every RAG system. It holds your document embeddings, performs similarity search against incoming queries, and returns the most relevant chunks for the language model to use as context. Setting it up correctly affects retrieval speed, accuracy, and the overall quality of generated responses.

This guide covers the practical setup process for the most widely used vector databases in RAG applications. Whether you choose a managed service or a self-hosted solution, the core steps are the same: create a collection configured for your embedding model, load your vectors with metadata, and configure search to return the best results.

Step 1: Choose a Vector Database

Your choice depends on three factors: scale, operational preference, and existing infrastructure. For teams that want zero infrastructure management, Pinecone provides a fully managed service with automatic scaling, backups, and monitoring. You create an index through the web console or API and start inserting vectors immediately. The tradeoff is cost at scale and less control over the underlying infrastructure.

For teams comfortable with self-hosting, Qdrant and Weaviate are the two leading open-source options. Qdrant offers excellent performance with a Rust-based engine, rich filtering capabilities, and straightforward Docker deployment. Weaviate provides a broader feature set including built-in vectorization modules, multi-tenancy support, and a GraphQL query interface. Both run well on a single server for smaller deployments and support distributed clustering for larger ones.

If you already run PostgreSQL, pgvector adds vector search as an extension without introducing a new database. This approach minimizes operational complexity since your vectors live alongside your relational data, but it has performance limitations at very high scale compared to purpose-built vector databases. For RAG knowledge bases under 1 million vectors, pgvector performs well and simplifies the architecture considerably.

Chroma is the simplest option for prototyping. It runs in-process as a Python library with no server needed, stores data locally, and requires zero configuration. Use Chroma for development and testing, then migrate to a production database when you need durability, concurrent access, or scale beyond a single machine.

Step 2: Install and Configure the Database

Pinecone (managed): Create an account, generate an API key, and install the client library. No server setup is required. Configure the environment and project name in your client initialization. Pinecone handles all infrastructure, replication, and scaling automatically.

Qdrant (self-hosted): Deploy using Docker with a single command. The default configuration works for development. For production, configure the storage path to a persistent volume, set an API key for authentication, enable TLS for encrypted connections, and adjust the WAL (write-ahead log) settings based on your write patterns. Qdrant also offers a managed cloud service if you prefer not to operate the infrastructure yourself.

Weaviate (self-hosted): Deploy using Docker Compose with the provided configuration templates. Enable the modules you need: text2vec for built-in vectorization, or configure external vectorization if you generate embeddings separately. Set authentication, configure persistence, and adjust memory limits based on your expected index size.

pgvector: Install the extension on your existing PostgreSQL instance. Create the extension in your database, then create a table with a vector column. The setup is minimal since it extends infrastructure you already operate, and it uses the same connection, authentication, and backup systems as the rest of your PostgreSQL data.

Step 3: Create a Collection with the Right Schema

Every vector database requires you to define the vector dimensions when creating a collection (or index). This must match your embedding model output exactly. OpenAI text-embedding-3-small produces 1536-dimensional vectors. BGE-M3 produces 1024 dimensions. Using the wrong dimension count causes insertion errors or silently degrades search quality.

Choose the distance metric that matches your embedding model. Most modern embedding models are trained with cosine similarity, which measures the angle between vectors regardless of magnitude. Some models use dot product (inner product) similarity, which also considers vector magnitude. Check your embedding model documentation for the recommended metric. Cosine similarity is the safe default if the documentation does not specify.

Define metadata fields for the attributes you want to filter on during search. Common metadata fields for RAG include source document path, document title, section heading, page number, chunk index, creation date, and any domain-specific tags. In Qdrant and Weaviate, you can define typed payload fields that support efficient filtering. In pgvector, metadata lives in regular PostgreSQL columns alongside the vector column, giving you full SQL filtering capabilities.

If your database supports multiple index types, choose based on your accuracy and speed requirements. HNSW (Hierarchical Navigable Small World) is the most common index type for vector search, offering a good balance of speed and recall. Configure the HNSW parameters: ef_construction controls index build quality (higher values produce better indexes but take longer to build), and m controls the number of connections per node (higher values improve recall but increase memory usage). Start with the defaults and tune after measuring baseline performance.

Step 4: Index Your Embeddings

Generate embeddings for all your document chunks using your chosen embedding model, then insert them into the vector database. Each record consists of a unique ID, the embedding vector, and the associated metadata. Include the original text content in the metadata so the retriever can pass it directly to the generator without a separate lookup.

Use batch upsert operations rather than inserting vectors one at a time. Batching reduces network overhead and allows the database to optimize its internal indexing operations. Most databases support batches of 100 to 1000 vectors per request. Start with batches of 100 and increase if throughput is a bottleneck.

For large knowledge bases (over 100,000 chunks), parallelize the embedding and insertion process. Generate embeddings in parallel batches using multiple API calls or GPU workers, then insert the results using multiple concurrent database connections. Monitor the insertion rate and adjust concurrency to avoid overwhelming the database or hitting API rate limits.

After indexing is complete, verify the collection by running a few test queries with known relevant documents. Check that the expected documents appear in the results and that similarity scores are reasonable. This basic sanity check catches common issues like mismatched dimensions, incorrect metadata mapping, or embedding model configuration errors before they affect the full pipeline.

Step 5: Configure Search Parameters

The most important search parameter is top-k, which controls how many results the retriever returns. Start with k=5 for most RAG applications. Too few results may miss relevant information. Too many results dilute the context with irrelevant chunks and waste context window space. Tune top-k based on your evaluation metrics: increase if recall is low, decrease if precision is low.

Add metadata filters to narrow search scope when applicable. If your knowledge base covers multiple products, filter by product name to avoid returning chunks from unrelated products. If your documents have temporal relevance, filter by date to prefer recent content. Metadata filtering happens before vector similarity scoring, so it reduces the search space and improves both speed and relevance.

For hybrid search, configure both vector search and keyword search (BM25) on the same collection. Qdrant and Weaviate both support hybrid search natively. For pgvector, combine the vector similarity search with PostgreSQL full-text search using tsvector columns. Merge the results using reciprocal rank fusion, which gives weight to documents that appear in both result sets. Hybrid search significantly improves retrieval for queries containing specific identifiers, product names, or technical terms that embedding models may not handle precisely.

Set a minimum similarity threshold to filter out results that are not relevant enough. If the highest-scoring result has a cosine similarity of 0.85 and the fifth result has a similarity of 0.4, the fifth result is likely not useful. A threshold of 0.5 to 0.6 works as a starting point for most embedding models, though the optimal value depends on your specific model and content domain.

Step 6: Optimize for Production

Performance tuning: Adjust the HNSW ef parameter (search-time exploration factor) to balance speed and recall. Higher ef values find more accurate results but take longer. For most RAG applications, ef values between 64 and 256 provide good results. Profile your query latency at different ef values against your evaluation set to find the right balance for your latency requirements.

Memory management: Vector databases keep indexes in memory for fast search. Estimate memory requirements as: number of vectors multiplied by dimensions multiplied by 4 bytes (for float32), plus overhead for the HNSW graph structure (roughly 1.5 to 2 times the raw vector size). For a collection of 1 million 1536-dimensional vectors, expect approximately 12 GB of memory for the vectors plus graph overhead. Plan capacity accordingly and configure memory limits to prevent out-of-memory failures.

Monitoring: Track query latency (p50, p95, p99), query throughput, index size, and memory usage. Set alerts for latency spikes and memory pressure. Most vector databases expose metrics through Prometheus endpoints or built-in dashboards. Connect these to your existing monitoring infrastructure for unified visibility.

Backup and recovery: Configure regular snapshots of your vector database. For Qdrant, use the snapshot API to create point-in-time backups. For Weaviate, configure the backup module to write to S3 or local storage. For pgvector, standard PostgreSQL backup tools (pg_dump, WAL archiving) handle vector data alongside relational data. Test your restore process before you need it in a real failure scenario.

Scaling: As your knowledge base grows, consider sharding your collection across multiple nodes. Qdrant and Weaviate both support distributed deployments with automatic sharding and replication. For pgvector, use PostgreSQL partitioning to split large tables. Plan your sharding strategy before you hit performance limits, since migrating a live collection is more complex than setting up sharding from the start.

Database-Specific Considerations

Each vector database has unique features that affect how you design your RAG retrieval layer. Pinecone supports namespaces that logically partition vectors within a single index, which is useful for multi-tenant applications where each customer has their own knowledge base. Qdrant supports multiple named vectors per point, allowing you to store embeddings from different models (for example, a general-purpose model and a domain-specific model) and query against either one. Weaviate supports multi-modal search with both text and image vectors in the same collection, which is valuable for knowledge bases containing visual content.

pgvector has a unique advantage in transactional consistency. Because it runs inside PostgreSQL, your vector inserts participate in database transactions alongside metadata updates. This means you can atomically insert a vector and its associated metadata in a single transaction, eliminating the consistency issues that can arise when a vector database and a metadata store are separate systems. For applications where data consistency matters more than maximum search throughput, pgvector provides guarantees that standalone vector databases do not.

Common Setup Mistakes

The most frequent setup mistake is mismatching vector dimensions between the embedding model and the collection configuration. This usually causes immediate errors, but some databases silently pad or truncate vectors, leading to degraded search quality without obvious errors. Always verify dimensions match exactly.

Another common mistake is not including the original text in the vector metadata. Without it, the retriever has to perform a separate lookup to get the text content for each retrieved chunk, adding latency and complexity. Store the chunk text alongside its vector so the retrieval response contains everything the generator needs.

Skipping metadata indexing is also problematic. If you plan to filter by document type or date range during search, make sure those metadata fields are indexed in the database. Unindexed metadata filtering falls back to scanning every vector in the collection, turning a millisecond query into a multi-second one at scale.

Key Takeaway

Start with the simplest database that meets your requirements (Chroma for prototyping, pgvector if you already run PostgreSQL, Pinecone for managed simplicity, Qdrant or Weaviate for self-hosted production). Configure dimensions and distance metric to match your embedding model exactly, batch your insertions, enable hybrid search for better retrieval, and plan memory and scaling before you hit limits.