What Is Ollama and How Does It Work
Ollama in Plain Terms
Think of Ollama as Docker for AI models. Just as Docker lets you pull and run containerized applications with a single command, Ollama lets you pull and run language models the same way. You type ollama run llama4 and the tool downloads the model from the Ollama library, loads it into memory, and opens an interactive chat session. No configuration files, no dependency management, no GPU driver troubleshooting required.
Ollama was created to solve a specific problem: running open source language models locally was unnecessarily complicated. Before Ollama, setting up a local model required downloading model weights manually, converting them to the right format, configuring inference parameters, setting up GPU offloading, and writing code to interact with the model. Ollama reduces all of that to a single command and a background API server.
The project is open source under the MIT license, meaning anyone can use, modify, and distribute it freely. It supports macOS, Linux, and Windows, and runs on any hardware from a laptop CPU to a multi-GPU server. The Ollama team maintains the core application and the model library, while a large community contributes models, integrations, and client libraries for various programming languages.
Core Architecture
Ollama consists of three main components: a model manager, an inference engine, and an API server. The model manager handles downloading, storing, and organizing models on your local disk. It uses a layered storage system similar to Docker images, where models that share a common base can share layers and reduce total disk usage. Models are stored in the GGUF format, a compact binary format designed for quantized model weights.
The inference engine is built on llama.cpp, a high performance C++ library for running transformer-based language models. Llama.cpp supports a wide range of hardware accelerators including NVIDIA GPUs through CUDA, AMD GPUs through ROCm, Apple Silicon through Metal, and Intel GPUs through SYCL. Ollama automatically detects your available hardware and configures the inference engine to use the fastest available option.
The API server runs as a background process on port 11434 by default. It provides REST endpoints for generating text, conducting conversations, creating embeddings, and managing models. The server handles model loading and unloading, keeping recently used models in memory for fast response times and automatically unloading them after a configurable idle period to free resources.
Model Format and Quantization
Ollama uses the GGUF (GPT-Generated Unified Format) file format for all its models. GGUF was specifically designed for efficient local inference and stores model weights in a quantized form that dramatically reduces memory requirements compared to the original training format. A model that requires 32GB in its full precision form might need only 5 to 8GB after quantization.
Quantization works by reducing the numerical precision of model parameters. Instead of storing each weight as a 16-bit or 32-bit floating point number, quantization converts them to lower-precision representations like 4-bit or 8-bit integers. The most commonly used quantization level is Q4_K_M, which provides excellent output quality while cutting memory requirements by roughly 75% compared to full precision. Higher quantization levels like Q5_K_M and Q8_0 preserve more quality at the cost of larger file sizes and higher memory usage.
Each model in the Ollama library typically comes in multiple quantization variants, letting you choose the right balance for your hardware. If you have 8GB of VRAM, you can run a Q4_K_M quantized 8B model comfortably. With 24GB, you can run a Q4_K_M quantized 14B model with room to spare, or step up to Q8_0 for the same model if quality is more important than available headroom.
The Modelfile System
Modelfiles are Ollama's equivalent of Dockerfiles. They let you create custom model configurations by specifying a base model, setting inference parameters, and defining system prompts. A Modelfile is a plain text file with a simple declarative syntax where each line sets a different aspect of the model's behavior.
A typical Modelfile starts with a FROM directive that specifies the base model, followed by PARAMETER directives for settings like temperature, context window size, and repetition penalty, and a SYSTEM directive for the system prompt. Running ollama create mymodel -f Modelfile builds a new named model that you can use like any other, with your custom parameters baked in.
This system is particularly useful for creating task-specific model variants. You might create one Modelfile for a coding assistant with a lower temperature and a system prompt focused on code quality, another for a creative writing assistant with a higher temperature, and a third for a customer support bot with specific domain knowledge in the system prompt. Each configuration can use the same underlying base model but behave differently based on the Modelfile settings.
How Requests Flow Through Ollama
When you send a request to the Ollama API, several things happen in sequence. First, the server checks whether the requested model is currently loaded in memory. If not, it loads the model from disk, mapping the GGUF file into GPU memory for the layers that fit and system RAM for any overflow. This loading step takes a few seconds on the first request but does not repeat for subsequent requests while the model stays loaded.
Once the model is ready, Ollama tokenizes your input text into a sequence of token IDs using the model's tokenizer. These tokens are processed through the model's transformer layers, generating new tokens one at a time in an autoregressive loop. Each generated token feeds back into the model as input for generating the next token, continuing until the model produces a stop token or reaches the maximum output length.
For streaming requests, each token is sent back to the client as soon as it is generated. For non-streaming requests, all tokens accumulate in a buffer and the complete response is returned as a single JSON object once generation finishes. The streaming mode provides faster perceived response times since users see the output building incrementally rather than waiting for the entire response.
After the request completes, the model stays loaded in memory for a configurable duration (defaulting to 5 minutes) so that subsequent requests can be served immediately without the loading delay. If no new requests arrive within that window, the model is unloaded to free memory for other models or system processes.
GPU Acceleration and Memory Management
Ollama's performance depends heavily on whether the model fits in GPU memory. When a model fits entirely in VRAM, all computation happens on the GPU and generation speeds typically reach 40 to 80 tokens per second on modern consumer GPUs. When a model exceeds available VRAM, Ollama splits it between GPU and CPU, with the GPU handling as many layers as possible and the CPU handling the rest. This split mode is functional but significantly slower for the CPU-processed layers.
The tool automatically determines the optimal split based on your available VRAM and the model's memory requirements. You do not need to manually configure layer counts or memory limits, though advanced users can override the defaults through environment variables if they want finer control. On Apple Silicon Macs, the unified memory architecture means the GPU can access all available system RAM directly, making these machines particularly capable for running larger models.
For multi-GPU setups, Ollama can distribute model layers across multiple cards. This lets you run models that would not fit on a single GPU by combining the VRAM of two or more cards. The overhead of cross-GPU communication adds some latency, but for models that would otherwise require CPU offloading, multi-GPU distribution provides substantially better performance.
Why Developers Choose Ollama
The primary appeal of Ollama is its combination of simplicity and capability. It provides the easiest on-ramp to local model inference while still offering enough flexibility for serious development work. A developer can go from zero to a working local AI assistant in under five minutes, and from there can build production integrations using the REST API or one of the many client libraries available for Python, JavaScript, Go, Rust, and other languages.
Privacy is another strong motivator. With Ollama, no data ever leaves your machine. Every prompt, every response, and every embedding is processed locally. This makes Ollama the natural choice for applications that handle sensitive data in regulated industries, proprietary codebases, personal information, or any scenario where sending data to a third-party server is unacceptable.
The elimination of API costs also matters, particularly for development workflows that involve many model calls. Prototyping, testing, debugging, and iterating on prompts can generate thousands of requests per day. With cloud APIs, this creates real expense. With Ollama, the only cost is the electricity to run your hardware, and there are no rate limits to slow you down.
Ollama simplifies local AI model inference by wrapping the llama.cpp engine with an intuitive CLI and REST API, making it possible to run powerful language models on your own hardware with a single command and zero configuration.