vLLM: High-Throughput Local Model Serving
Core Innovation: PagedAttention
Traditional LLM inference engines allocate GPU memory for the KV (key-value) cache in large contiguous blocks. Since the final sequence length is unknown at the start of generation, engines typically allocate for the maximum possible context length. This wastes enormous amounts of memory on sequences that end up being much shorter than the maximum.
PagedAttention, the core technology behind vLLM, borrows the concept of virtual memory paging from operating systems. Instead of allocating one large contiguous block per sequence, it divides the KV cache into small fixed-size pages (typically 16 tokens each) and allocates pages on demand as the sequence grows. Pages can be stored non-contiguously in GPU memory, just like virtual memory pages can be stored anywhere in physical RAM.
This approach reduces memory waste by 50% or more compared to traditional allocation. That recovered memory translates directly into capacity: the same GPU can serve more concurrent requests, hold a larger model, or maintain longer context windows. On benchmarks, PagedAttention enables 2-4x higher throughput compared to naive memory management at the same hardware cost.
Continuous Batching
Ollama and simple inference servers process one request at a time (or a small fixed batch). If 10 requests arrive simultaneously, most wait in a queue. vLLM uses continuous batching (also called iteration-level scheduling) to interleave multiple requests within a single GPU computation step.
When a request finishes generating, its GPU resources are immediately freed and assigned to the next waiting request, without waiting for the entire batch to complete. This means the GPU is almost never idle, maximizing utilization. Under high concurrency, this is the primary reason vLLM achieves 10-16x higher throughput than single-request servers.
Tensor Parallelism
For models that exceed a single GPU memory capacity, vLLM supports tensor parallelism across multiple GPUs. The model layers are split across GPUs, with each GPU computing its portion and communicating intermediate results via NVLink or PCIe. Starting a 70B model across two GPUs requires simply setting the tensor parallelism degree to 2 in the launch command.
Pipeline parallelism, where different model layers run on different GPUs, is also supported for cases where tensor parallelism alone is not sufficient. This is relevant for the largest models (200B+ parameters) that may span 4-8 GPUs.
OpenAI-Compatible API
vLLM serves an OpenAI-compatible API at the /v1/chat/completions and /v1/completions endpoints. Any client library written for OpenAI, and most LLM application frameworks, can connect to vLLM by changing the base URL. This makes migration from Ollama (which also supports this API) or from cloud APIs straightforward.
The API supports streaming, function calling, JSON mode, and most parameters from the OpenAI specification. vLLM also provides a native API with additional features like guided decoding (constraining output to match a JSON schema or regular expression), logprob access, and detailed generation statistics.
Performance Benchmarks
On NVIDIA Blackwell B200 GPUs running Llama 3.1 70B with NVFP4 quantization, vLLM achieves approximately 8,033 tokens per second with a time-to-first-token of 10.7ms. By comparison, Ollama on the same hardware achieves 484 tokens per second with 65ms TTFT. This 16.6x throughput advantage reflects the combined impact of PagedAttention, continuous batching, and CUDA-optimized kernels.
The advantage grows with concurrency. At 1 concurrent request, vLLM and Ollama perform similarly. At 10 concurrent requests, vLLM is roughly 5x faster. At 50+ concurrent requests, the gap widens to 10-20x because Ollama queues requests while vLLM batches them.
When to Choose vLLM Over Ollama
Choose vLLM when: you serve more than 5 concurrent users, you need to maximize throughput for batch processing, you require tensor parallelism across multiple GPUs, or you need production features like guided decoding and detailed metrics. Choose Ollama when: you are a single developer or small team, you value simplicity over throughput, you run on Apple Silicon (vLLM requires NVIDIA GPUs), or you want an all-in-one tool that handles model downloading and management.
vLLM is the production standard for self-hosted LLM serving. PagedAttention and continuous batching deliver order-of-magnitude throughput improvements over simpler servers. Use it whenever you outgrow Ollama single-user performance limits.