Running Ollama in Docker

Updated May 2026
Running Ollama in Docker provides containerized isolation, reproducible deployments, and straightforward GPU passthrough for team and production environments. The official ollama/ollama Docker image supports NVIDIA GPU acceleration through the NVIDIA Container Toolkit, named volumes for persistent model storage, and Docker Compose orchestration with companion services like Open WebUI.

Prerequisites for GPU Access

Docker containers do not have GPU access by default. To run Ollama with GPU acceleration in Docker, you need the NVIDIA Container Toolkit installed on your host system. This toolkit provides the nvidia-container-runtime that enables Docker containers to access NVIDIA GPUs. Install it through your distribution's package manager or follow NVIDIA's installation guide for your operating system.

Verify the toolkit is working by running docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi. If you see your GPU information, the toolkit is configured correctly. If not, check that the NVIDIA drivers are installed on the host, that the container toolkit package is installed, and that the Docker daemon has been restarted after installation.

For AMD GPU users, ROCm container support is available through AMD's container runtime, though the setup process differs from NVIDIA. Apple Silicon Macs running Docker Desktop do not support GPU passthrough to containers, so Ollama in Docker on macOS runs in CPU-only mode. For GPU acceleration on macOS, install Ollama natively instead of using Docker.

Basic Docker Setup

The simplest way to run Ollama in Docker is a single command: docker run -d --gpus all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama. This starts the Ollama server in the background with GPU access, maps the model storage directory to a named volume for persistence, and exposes the API on the standard port 11434.

The -v ollama:/root/.ollama flag creates a named Docker volume that persists model data between container restarts. Without this volume, every container restart would require re-downloading all your models. Named volumes survive container deletion and can be backed up, migrated, or shared between containers.

Once the container is running, interact with it exactly as you would with a native Ollama installation. Use docker exec -it ollama ollama run llama4 to start an interactive chat session, or send API requests to http://localhost:11434 from the host machine. The containerized Ollama API is identical to the native version.

To pull models into the container, run docker exec -it ollama ollama pull qwen3:14b. The model downloads into the named volume and persists across container restarts. You can pre-pull multiple models as part of your deployment process to ensure they are available when the service starts receiving requests.

Docker Compose Configuration

Docker Compose simplifies multi-container setups that combine Ollama with other services. A typical composition pairs Ollama with Open WebUI to provide a ChatGPT-style web interface backed by local models. The Compose file defines both services, their networking, and their storage volumes in a single configuration.

A production-ready Docker Compose configuration includes the Ollama service with GPU reservation, a named volume for model storage, environment variables for tuning (like OLLAMA_NUM_PARALLEL and OLLAMA_KEEP_ALIVE), health checks to verify the service is responsive, and restart policies to recover from failures. The Open WebUI service connects to Ollama through Docker's internal network, with its own volume for conversation history and user data.

For GPU allocation in Compose, use the deploy.resources.reservations.devices block to specify GPU requirements. You can reserve all GPUs, a specific number of GPUs, or specific GPU IDs, giving you control over hardware allocation in multi-GPU systems where different services need different GPUs.

Environment Variable Configuration

Docker is particularly well-suited for configuring Ollama through environment variables, as the -e flag or Compose environment section makes configuration explicit and version-controlled. Key variables include OLLAMA_HOST set to 0.0.0.0:11434 to accept connections from outside the container, OLLAMA_NUM_PARALLEL for concurrent request handling, OLLAMA_MAX_LOADED_MODELS to control memory usage, and OLLAMA_KEEP_ALIVE to manage model loading behavior.

Setting OLLAMA_HOST=0.0.0.0:11434 is important in Docker, as the default 127.0.0.1 binding only accepts connections from within the container. The 0.0.0.0 binding allows connections from the host machine, other containers, and the Docker network, which is necessary for the API to be accessible outside the container.

For production deployments, consider setting OLLAMA_KEEP_ALIVE=-1 to keep models loaded indefinitely, eliminating the model loading delay on requests that arrive after the default 5-minute idle timeout. This trades memory usage for consistent response latency, which is typically the right trade-off for production services where memory is dedicated to model serving.

Networking and Security

By default, exposing port 11434 makes the Ollama API accessible to anyone who can reach the host machine's network. For team and production deployments, place a reverse proxy like nginx or Traefik in front of Ollama to add authentication, TLS encryption, and access control. The reverse proxy can run as another service in your Docker Compose configuration, creating a secure, self-contained deployment stack.

Docker's internal networking provides isolation between the Ollama service and external access. Services within the same Docker Compose network (like Open WebUI) can reach Ollama through the service name, while external access goes through the published port. This separation lets you expose the web interface on port 443 with TLS while keeping the raw Ollama API accessible only internally.

For multi-tenant environments, run separate Ollama containers for different teams or projects, each with its own model storage volume and port mapping. This provides resource isolation and prevents one team's model usage from affecting another's, while all containers share the same GPU resources through the NVIDIA Container Toolkit's scheduling.

Model Storage and Persistence

Model files in Docker should always use named volumes or bind mounts rather than the container's writable layer. Named volumes (-v ollama:/root/.ollama) are managed by Docker and provide the best balance of portability and performance. Bind mounts (-v /host/path:/root/.ollama) map to a specific host directory, which is useful when you want direct filesystem access to model files or when sharing models between Docker and a native Ollama installation.

Model files can be large, with a single 70B model requiring over 40GB of disk space. Plan your Docker volume storage accordingly, especially on systems where the Docker root directory has limited space. Moving Docker's storage location or using a separate disk for volumes can help manage disk usage on systems with smaller root partitions.

Backing up model volumes is straightforward with Docker's volume management commands. You can export a volume to a tar archive, copy it to another machine, and import it, effectively migrating your entire model library between systems without re-downloading. This is particularly useful for deploying to environments without internet access.

Production Deployment Patterns

For production use, combine Ollama in Docker with health monitoring, automatic restarts, and log management. Docker's built-in health check and restart policies handle basic availability. For more sophisticated monitoring, integrate with Prometheus, Grafana, or your existing monitoring stack through the Ollama API's health endpoint and system metrics.

Pre-warming models at startup ensures consistent response latency from the first request. Add an initialization step to your deployment that sends a simple request to each model you expect to serve, forcing Ollama to load them into GPU memory before user traffic arrives. This can be implemented as a Docker health check that validates model availability.

For horizontal scaling beyond a single machine, deploy Ollama containers across multiple GPU-equipped hosts and use a load balancer to distribute requests. Each host runs its own Ollama instance with its own model storage, and the load balancer routes requests based on model availability and current load. This pattern works well for organizations with multiple GPU servers that want to provide a unified API endpoint for local model inference.

Key Takeaway

Docker is the recommended deployment method for team and production Ollama setups. Use named volumes for model persistence, the NVIDIA Container Toolkit for GPU access, and Docker Compose to orchestrate Ollama with companion services like Open WebUI and reverse proxies.