How to Build a Self-Hosted AI Stack from Scratch
Before You Start
You need a Linux, macOS, or Windows machine with at least 16 GB of RAM and 50 GB of free disk space. A dedicated GPU with 6 GB or more of VRAM is recommended for responsive inference but not required. These instructions assume you are comfortable with the command line and have administrative access to your machine. All components run in Docker containers, so the host operating system does not matter as long as Docker is supported.
The complete stack (Ollama, Open WebUI, Qdrant, n8n) uses approximately 2 GB of RAM beyond the model inference requirements. With a 7B model loaded, total RAM usage is about 6 to 8 GB. With a 13B model, it rises to 10 to 14 GB. Plan your hardware accordingly and leave headroom for the operating system and other applications.
Step 1: Install Docker and Docker Compose
Docker provides the container runtime that isolates each stack component. Install Docker Engine following the official documentation for your operating system. On Ubuntu and Debian, this involves adding Docker's package repository and installing the docker-ce package. On macOS, install Docker Desktop. On Windows, install Docker Desktop with the WSL 2 backend for best performance with GPU passthrough.
Verify the installation by running docker --version and docker compose version from the command line. Both commands should return version numbers without errors. If you plan to use GPU acceleration, install the NVIDIA Container Toolkit (nvidia-docker2) which enables Docker containers to access your NVIDIA GPU. Verify GPU access with docker run --gpus all nvidia/cuda:12.0-base nvidia-smi, which should display your GPU information.
Step 2: Set Up Ollama for Model Inference
Ollama can run directly on the host or inside a Docker container. For a Docker-based stack, run Ollama as a container with GPU access. Create a directory for your stack configuration and add Ollama to your docker-compose.yml file. The Ollama container mounts a local directory for model storage (so models persist across container restarts) and exposes port 11434 for API access.
Start the Ollama container and pull your first model. A good starting choice is llama3.1:8b for a balance of quality and speed, or qwen2.5:7b for strong multilingual and reasoning capabilities. The model download takes a few minutes depending on your internet speed (typical model sizes are 4 to 5 GB for 7B at 4-bit quantization). Verify that inference works by sending a test prompt through the API or using docker exec to run ollama run llama3.1:8b inside the container.
Test the API endpoint directly: send a POST request to http://localhost:11434/api/generate with a JSON body containing your prompt and model name. If you receive a streamed response with generated text, Ollama is working correctly. This API endpoint is what all other stack components will use to access model inference.
Step 3: Add Open WebUI for a Chat Interface
Add the Open WebUI container to your docker-compose.yml file. Configure it to connect to Ollama's API endpoint (http://ollama:11434 when both containers share a Docker network). Open WebUI stores its data (conversations, user accounts, settings) in a mounted volume. Expose port 3000 (or your preferred port) for web access.
Start the container and open http://localhost:3000 in your browser. Create your admin account on first access. Open WebUI automatically detects available models from Ollama and displays them in the model selector. Send a test message to verify that the full pipeline works: your browser sends the message to Open WebUI, which forwards it to Ollama, which generates a response that flows back through the same chain.
Configure Open WebUI's settings to your preferences: enable or disable user registration, set default model parameters (temperature, context length), configure web search integration if desired, and set up document upload handling for basic RAG. Each setting is accessible through the admin panel without editing configuration files.
Step 4: Add Qdrant for Vector Search
Add the Qdrant container to your docker-compose.yml file. Qdrant stores vector data in a mounted volume for persistence and exposes its REST API on port 6333 and gRPC on port 6334. The default configuration works well for most setups. Start the container and verify access by opening http://localhost:6333/dashboard in your browser, which shows Qdrant's built-in management interface.
To use Qdrant with Open WebUI's RAG pipeline, configure Open WebUI to use Qdrant as its vector store backend through the admin settings. This tells Open WebUI to embed uploaded documents and store the vectors in Qdrant rather than its default SQLite storage. The embedding model (typically nomic-embed-text, pulled through Ollama) handles the text-to-vector conversion.
Test the RAG pipeline by uploading a document through Open WebUI, waiting for embedding to complete, and then asking a question about the document's content. The response should include information from the document that the base model would not know, confirming that retrieval is working correctly.
Step 5: Add n8n for Workflow Orchestration
Add the n8n container to your docker-compose.yml file. n8n stores workflow definitions and execution history in a mounted volume. Expose port 5678 for the web interface. Configure n8n to access Ollama and Qdrant through the Docker network using their service names as hostnames. Start the container and create your admin account at http://localhost:5678.
Create your first AI workflow: add a Manual Trigger node, connect it to an AI Agent node configured with your Ollama model, and add an output node. Execute the workflow and verify that n8n successfully communicates with Ollama and returns a generated response. This confirms that the orchestration layer can access the inference layer through the Docker network.
Expand the workflow by adding tool nodes. Connect a Qdrant Vector Store node to the AI Agent to enable knowledge retrieval. Add an HTTP Request node to let the agent fetch web content. Add a Code node to let the agent execute Python or JavaScript. Each tool expands the agent's capabilities while keeping the workflow visible and debuggable on the n8n canvas.
Step 6: Connect and Verify
With all components running, verify the full integration. Send a chat message through Open WebUI that requires knowledge from an uploaded document (testing Ollama plus Qdrant through the UI). Create an n8n workflow that processes a webhook, queries Qdrant for context, generates a response through Ollama, and sends the result to a test endpoint (testing n8n plus Qdrant plus Ollama programmatically). Both paths should produce accurate, context-aware responses.
Set up persistence and backup. Verify that all mounted volumes point to directories on your host that are included in your backup strategy. Restart the entire stack (docker compose down followed by docker compose up -d) and confirm that all data (conversations, models, vectors, workflows) survives the restart. This validation ensures your stack is production-ready.
Finally, review resource usage. Run docker stats to see CPU, memory, and network usage for each container. Identify any container using more resources than expected and adjust Docker resource limits in your compose file if needed. A well-configured stack should idle at low resource usage and spike only during active inference.
Build your stack one layer at a time, verifying each component works before adding the next. This incremental approach makes troubleshooting straightforward: if something breaks after adding a new component, the problem is in the new component or its connection to the existing ones.