How to Set Up Self-Hosted AI Agents with Docker
Docker Compose is the recommended approach for single-machine deployments. It lets you define your entire stack in a single YAML file, start everything with one command, and manage all services as a unified application. For multi-machine or high-availability deployments, the same container images work with Kubernetes or Docker Swarm, but Docker Compose is the right starting point for most teams.
Step 1: Install Docker and NVIDIA Container Toolkit
Start with Docker Engine, not Docker Desktop. Docker Engine is free, lighter weight, and better suited for server deployments. On Ubuntu, add the official Docker repository and install docker-ce, docker-ce-cli, containerd.io, and docker-compose-plugin. Then add your user to the docker group so you can run Docker commands without sudo.
Next, install the NVIDIA Container Toolkit. This package lets Docker containers access your NVIDIA GPU. Add the NVIDIA container toolkit repository to your package manager, install nvidia-container-toolkit, then restart the Docker daemon. Verify the installation by running the nvidia/cuda base image with the --gpus all flag and executing nvidia-smi inside it. You should see your GPU listed in the output.
If the GPU does not appear, verify that your NVIDIA drivers are installed correctly on the host (nvidia-smi should work outside Docker), and that the container toolkit's daemon configuration is correct. The most common issue is forgetting to restart the Docker daemon after installing the toolkit.
Step 2: Configure GPU Passthrough
Docker Compose uses the deploy section to allocate GPU resources to containers. In your docker-compose.yml, services that need GPU access include a deploy.resources.reservations.devices block that specifies GPU capabilities. This tells Docker to pass the NVIDIA GPU to that specific container.
For a single-GPU system, allocate the GPU to your inference server container. Other containers (orchestration platform, databases, monitoring) run on CPU and do not need GPU access. If you have multiple GPUs, you can allocate specific GPUs to different services using device IDs, or pass all GPUs to the inference server for model parallelism.
Set the NVIDIA_VISIBLE_DEVICES environment variable in your inference container to control which GPUs are accessible. Set it to "all" to expose all GPUs, or to specific device indices (0, 1, etc.) to expose individual GPUs.
Step 3: Create Your Docker Compose Stack
A typical AI agent stack in Docker Compose includes three to five services. The inference server (Ollama or vLLM) handles model loading and text generation. The orchestration platform (Dify, Flowise, or n8n) manages agent workflows. A database (PostgreSQL) stores application data, conversation history, and optionally vector embeddings via the pgvector extension. Redis provides caching and session management. Optionally, a monitoring service (Langfuse or Grafana) tracks performance and usage.
Each service should specify: the container image and version tag (avoid using "latest" in production, pin specific versions), environment variables for configuration, volume mounts for persistent data, port mappings for services that need external access, health check definitions, and restart policies.
For the inference server, mount a volume for model storage so downloaded models persist across container restarts. For databases, mount a volume for data directories. For the orchestration platform, mount volumes for any configuration files or uploaded documents.
Step 4: Configure Volumes and Networking
Define named volumes for each service's persistent data. Named volumes are managed by Docker and survive container rebuilds, upgrades, and restarts. Avoid bind mounts for database data in production, as named volumes provide better performance and isolation.
Docker Compose automatically creates a network for your stack. Services within the same compose file reach each other by service name. Your inference server might be accessible at http://ollama:11434 from other containers. Your database at postgresql://postgres:5432. Configure your orchestration platform's environment variables with these internal service URLs.
For external access, expose only the ports you need. Typically, you expose the orchestration platform's web interface (port 80 or 443) and possibly the inference server's API (port 11434 or 8000) if you call it from outside the Docker network. Do not expose database ports to the internet.
For production, place a reverse proxy (Nginx, Caddy, or Traefik) in front of your services. The reverse proxy handles TLS termination, authentication, and request routing. Traefik integrates natively with Docker and can automatically obtain Let's Encrypt certificates.
Step 5: Deploy and Verify
Start your stack with docker compose up -d (the -d flag runs in detached mode). Docker pulls any missing images, creates volumes and networks, and starts all services in dependency order. Monitor the startup with docker compose logs -f to watch for errors.
Verify each component independently. Test the inference server by sending a direct API request using curl. Test the orchestration platform by accessing its web interface. Test the database by connecting with a database client. If any service fails to start, check its logs with docker compose logs [service_name] for error details.
Verify GPU access by checking the inference server logs for GPU detection messages. Ollama logs show "using CUDA" or "using Metal" (on macOS) when GPU acceleration is active. vLLM logs show the detected GPU model and available VRAM. If logs show CPU-only operation, revisit the GPU passthrough configuration.
Run a complete end-to-end test by creating a simple agent in your orchestration platform and sending it a test query. The response should come back within a few seconds, confirming that the inference server, orchestration platform, and inter-service communication are all working correctly.
Production Considerations
Restart policies: Set restart: unless-stopped or restart: always for all services so they automatically recover from crashes or system reboots.
Resource limits: Define memory limits for non-GPU services to prevent any single container from consuming all system RAM. Leave the inference server unconstrained (or set limits generously) since it needs access to all available GPU VRAM.
Logging: Configure log rotation to prevent container logs from consuming all disk space. Docker's json-file logging driver supports max-size and max-file options that automatically rotate and clean up old logs.
Backups: Back up Docker volumes regularly. For databases, use the database's native backup tools (pg_dump for PostgreSQL) rather than copying volume files directly, as file-level copies may be inconsistent if the database is running during the copy.
Updates: To update a service, pull the new image (docker compose pull [service_name]), then recreate the container (docker compose up -d [service_name]). Docker Compose handles stopping the old container and starting the new one with the same volumes and configuration. Always back up data before updating.
Docker Compose packages your entire AI agent stack into a reproducible, portable configuration. Install Docker and the NVIDIA Container Toolkit, define your services in a compose file with GPU passthrough for the inference server, configure persistent volumes, and deploy with a single command.