Ollama Deep Dive: Architecture and Features
Architecture Overview
Ollama is built as a Go application that manages and coordinates llama.cpp, the C++ inference engine that handles the actual model execution. When you run ollama serve, Ollama starts a background daemon that listens on port 11434 for incoming API requests. When a request arrives, the daemon loads the requested model (if not already in memory), passes the prompt to the llama.cpp engine, and streams tokens back to the client.
The separation between Ollama (the orchestrator) and llama.cpp (the engine) is key to understanding its capabilities and limitations. Ollama handles model downloading, storage, configuration, and API serving. llama.cpp handles tokenization, model loading, GPU memory management, quantized matrix multiplication, attention computation, and token sampling. Ollama adds convenience and usability on top of raw inference performance.
Models are stored in a Docker-like layered format. When you run ollama pull llama3.2, Ollama downloads the model as a series of blobs (binary large objects) and stores them in ~/.ollama/models/. A manifest file maps model names and tags to specific blob hashes. This layered approach means that models sharing the same base weights (such as different quantizations of the same model) share common layers on disk.
The Model Library
Ollama maintains a curated model library at ollama.com/library, offering pre-configured models from all major model families. Each model comes in multiple sizes and quantization levels. Running ollama pull llama3.1:70b-instruct-q4_K_M downloads the 70B parameter Llama 3.1 model with Q4_K_M quantization specifically for instruction following. If you just run ollama pull llama3.1, you get the default configuration (typically the instruct variant at a balanced quantization level).
The library covers Llama, Mistral, Gemma, Phi, CodeGemma, DeepSeek, Qwen, and dozens of other model families. Community members can publish their own models and fine-tuned variants. The Modelfile system (similar to a Dockerfile) lets you create custom model configurations with specific system prompts, temperature settings, and parameter overrides.
API and Integration
Ollama exposes two API formats. The native Ollama API uses endpoints like /api/generate and /api/chat, while the OpenAI-compatible API uses /v1/chat/completions. The OpenAI compatibility layer means most client libraries and applications written for OpenAI can connect to Ollama by changing the base URL to http://localhost:11434/v1.
Both APIs support streaming responses (server-sent events), system prompts, multi-turn conversations, and JSON mode for structured output. Version 0.17.5 (March 2026) added streaming tool calls, allowing models to call external functions mid-generation, and thinking model support for chain-of-thought reasoning models.
For embedding generation, Ollama provides a /api/embed endpoint that returns vector embeddings from models like nomic-embed-text, useful for building local RAG (Retrieval Augmented Generation) pipelines without any cloud dependency.
GPU and Memory Management
Ollama automatically detects available GPUs (NVIDIA via CUDA, AMD via ROCm, Apple Silicon via Metal) and configures llama.cpp accordingly. On systems with an NVIDIA GPU, Ollama offloads as many model layers as will fit in VRAM, falling back to CPU for the rest. This hybrid CPU/GPU inference means you can run models larger than your VRAM, though with reduced speed for the CPU-processed layers.
Memory management follows a keep-alive pattern. By default, Ollama keeps a loaded model in memory for 5 minutes after the last request. This means the first request to a model takes longer (model loading), but subsequent requests respond immediately. You can configure the keep-alive duration or set it to keep models loaded indefinitely.
When multiple models are requested simultaneously, Ollama manages memory by loading and unloading models as needed. On systems with sufficient memory, it can keep multiple models loaded concurrently. On constrained systems, it evicts the least recently used model to make room for the requested one.
Multimodal Capabilities
Ollama supports vision models that can process images alongside text. Models like LLaVA, Llama 4 Scout, and Mistral Small 4 accept image inputs through the API. You pass images as base64-encoded data in the prompt, and the model processes both the text and visual content together. This enables local image analysis, document understanding, and visual question answering without any cloud service.
Practical Limitations
Ollama is optimized for simplicity and single-user workloads. Its primary limitation is concurrency. Under a single user or a small team (up to 3-5 concurrent users), Ollama performs well. Beyond that, request queuing and latency degradation become significant. The root cause is that llama.cpp processes requests sequentially (or with limited parallelism), unlike production servers like vLLM that use continuous batching to handle dozens or hundreds of concurrent requests efficiently.
Another limitation is the lack of advanced serving features: no built-in load balancing across multiple GPUs on different machines, no automatic scaling, no request-level priority queuing, and limited observability tooling. These features exist in production inference servers but are outside the scope of what Ollama aims to provide.
Ollama is the best tool for getting started with local LLMs and for single-user or small-team workloads. Its strength is simplicity. When you outgrow it, the migration path to vLLM or other production servers is straightforward because both use OpenAI-compatible APIs.