How to Serve Local Models via API

Updated May 2026
Serving a local LLM via API means running an inference server that exposes HTTP endpoints compatible with the OpenAI API format. This lets any application, framework, or tool that supports OpenAI connect to your local model by changing a single URL. Both Ollama and vLLM provide this capability with different tradeoffs.

Step 1: Choose Your Server

Ollama is the right choice for development, personal use, and small teams (up to 3-5 concurrent users). It handles model downloading, GPU detection, and memory management automatically. Setup takes one command. The tradeoff is limited concurrency and throughput.

vLLM is the right choice for production deployments serving many concurrent users. It provides 10-16x higher throughput through PagedAttention and continuous batching. The tradeoff is more complex setup, NVIDIA GPU requirement, and no built-in model management (you download models separately from Hugging Face).

Both servers speak the OpenAI API protocol, so your application code is the same regardless of which server you use. You can develop with Ollama and deploy with vLLM without changing your application.

Step 2: Install and Start

Ollama Setup

Install Ollama from ollama.com (available for macOS, Linux, and Windows). Once installed, pull a model: ollama pull llama3.1:8b. The server starts automatically and listens on port 11434. The OpenAI-compatible endpoint is at http://localhost:11434/v1/chat/completions.

To run Ollama as a persistent service on Linux, use systemd. The Ollama installer typically creates a systemd service automatically. On macOS, Ollama runs as a menu bar application that starts at login.

vLLM Setup

Install vLLM via pip: pip install vllm. Start the server with: vllm serve meta-llama/Llama-3.1-8B-Instruct. vLLM downloads the model from Hugging Face on first run (requires a Hugging Face account and model access approval for gated models like Llama). The OpenAI-compatible endpoint runs on port 8000 by default.

For multi-GPU setups, add the tensor parallelism flag: vllm serve model-name --tensor-parallel-size 2. This distributes the model across two GPUs automatically.

Step 3: Test the API

Test with a curl command or any HTTP client. Send a POST request to the chat completions endpoint with a messages array containing the conversation. The response follows the same format as the OpenAI API, with choices containing the generated message and usage statistics.

Verify streaming works by adding "stream": true to the request body. The server should return server-sent events with incremental token chunks, identical to OpenAI streaming behavior.

Step 4: Connect Your Application

Most LLM client libraries support custom base URLs. In the OpenAI Python library, set base_url="http://localhost:11434/v1" for Ollama or base_url="http://localhost:8000/v1" for vLLM. The api_key parameter is required by the client library but can be set to any string since local servers do not authenticate.

Frameworks like LangChain, LlamaIndex, AutoGen, and CrewAI all support custom OpenAI endpoints. Typically, you set an environment variable (OPENAI_API_BASE) or pass the base URL in the model configuration. No framework-level code changes are needed.

Step 5: Production Configuration

Process management: Run the inference server under a process manager (systemd, supervisord, or Docker) that automatically restarts it after crashes or reboots. Monitor memory usage and GPU utilization to detect issues early.

Reverse proxy: Place nginx or Caddy in front of the inference server to handle TLS termination, request logging, and basic rate limiting. This also lets you expose the API on standard ports (443) while the inference server runs on its internal port.

Security: If the API is accessible beyond localhost, add authentication. The simplest approach is an API key checked by the reverse proxy. Do not expose an unauthenticated inference endpoint to the internet, as anyone could use your GPU resources.

Monitoring: Track request latency, tokens per second, GPU memory usage, and error rates. vLLM provides Prometheus metrics at the /metrics endpoint. For Ollama, monitor system-level GPU metrics using nvidia-smi or similar tools.

Key Takeaway

Setting up a local LLM API endpoint takes minutes with Ollama or slightly longer with vLLM. Both produce OpenAI-compatible endpoints, meaning your application code works identically whether targeting a local model or a cloud API. Develop locally, deploy to production by switching the URL.