Ollama API: Endpoints and Integration

Updated May 2026

The Ollama REST API runs on http://localhost:11434 and provides endpoints for text generation, chat conversations, embedding creation, and model management. It also exposes an OpenAI-compatible endpoint at /v1 that works with most libraries and tools designed for the OpenAI API, making integration with existing codebases straightforward.

Generation Endpoints

The POST /api/generate endpoint handles single-turn text completion. You send a JSON body with a model field specifying which model to use and a prompt field containing your input text. The response includes the generated text in the response field, along with metadata like total token counts and generation duration. By default, responses stream token by token as newline-delimited JSON objects. Set "stream": false to receive the complete response as a single JSON object.

The POST /api/chat endpoint handles multi-turn conversations with message history. Instead of a prompt field, you provide a messages array containing objects with role (system, user, or assistant) and content fields. This format matches the OpenAI chat completions API, making it familiar to developers who have worked with cloud APIs. The endpoint maintains no server-side conversation state; you must send the full message history with each request.

Both endpoints accept optional parameters that control generation behavior. temperature (default 0.8) controls randomness, with lower values producing more deterministic output. top_p and top_k provide alternative sampling strategies. num_predict sets the maximum number of tokens to generate. stop accepts an array of strings that trigger generation to halt when encountered in the output. seed enables reproducible generation when combined with a fixed temperature.

The options object within the request body lets you override model-level parameters for individual requests. This includes num_ctx for context window size, repeat_penalty for reducing repetitive output, num_gpu for controlling GPU layer allocation, and many other parameters that correspond to the underlying llama.cpp configuration.

Embedding Endpoint

The POST /api/embed endpoint generates vector embeddings from text input. Send a JSON body with the model field and an input field containing either a single string or an array of strings for batch processing. The response includes an embeddings array with one vector per input string, where each vector is an array of floating point numbers.

Batch processing is the recommended approach for embedding large document collections. Instead of making one API call per document, group multiple texts into a single request. The model processes them together more efficiently than handling individual requests sequentially, significantly reducing total processing time for large datasets.

The older POST /api/embeddings endpoint (note the plural) still works but is considered deprecated. It accepts a single prompt string and returns a single embedding vector. New integrations should use the /api/embed endpoint for its batching support and improved feature set.

Common embedding models on Ollama include nomic-embed-text (768 dimensions), mxbai-embed-large (1024 dimensions), and snowflake-arctic-embed (various dimensions). Choose based on your retrieval accuracy requirements and vector storage constraints. Higher dimensions generally provide better semantic discrimination at the cost of larger storage requirements.

Model Management Endpoints

GET /api/tags returns a list of all locally installed models with their names, sizes, modification dates, and digest hashes. This is the programmatic equivalent of ollama list and is useful for building model selection interfaces or monitoring which models are available on a system.

POST /api/pull downloads a model from the Ollama library. Send {"name": "qwen3:14b"} to download a specific model variant. The endpoint streams progress updates during download, reporting bytes transferred and total size. Set "stream": false for a single response when the download completes.

DELETE /api/delete removes a model from local storage. Send {"name": "model-name"} to delete a specific model and free its disk space. This is the programmatic equivalent of ollama rm.

POST /api/show returns detailed metadata about a model, including its Modelfile definition, parameter settings, template format, and license text. This endpoint is useful for inspecting model configurations and verifying parameter settings without examining the Modelfile directly.

POST /api/create builds a new model from a Modelfile specification. Send the Modelfile content in the modelfile field and a name for the new model in the name field. This is the programmatic equivalent of ollama create.

POST /api/copy duplicates an existing model under a new name. GET /api/ps shows currently loaded models with their memory usage and processing state. These endpoints round out the model lifecycle management capabilities of the API.

OpenAI-Compatible Endpoint

Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1. This endpoint implements the same request and response format as the OpenAI Chat Completions API, making it possible to use Ollama with any library, framework, or tool that supports the OpenAI API by simply changing the base URL.

The compatible endpoint supports /v1/chat/completions for chat, /v1/completions for text completion, /v1/models for listing available models, and /v1/embeddings for generating embeddings. Request parameters like temperature, max_tokens, top_p, stop sequences, and stream are mapped to their Ollama equivalents automatically.

To use this with the OpenAI Python library, set the base URL to http://localhost:11434/v1 and the API key to any non-empty string (Ollama ignores the key but the library requires one). Then use the standard OpenAI client methods with Ollama model names. This pattern works with LangChain, LlamaIndex, AutoGen, and most other frameworks that integrate with the OpenAI API.

Not every OpenAI API feature is supported through the compatible endpoint. Function calling works with models that support it, but advanced features like logprobs, response format enforcement, and some fine-tuning capabilities are not available. For features specific to Ollama, like Modelfile management and the embed endpoint, use the native Ollama API directly.

Streaming and Error Handling

Streaming responses are the default for generation endpoints. Each token is sent as a separate JSON object followed by a newline, enabling real-time display of generation output. The final object in the stream includes "done": true along with summary statistics like total duration, token counts, and tokens per second.

Error responses use standard HTTP status codes. A 404 indicates the requested model is not installed. A 400 indicates a malformed request. A 500 indicates an internal error, typically related to model loading or GPU memory issues. Error responses include a JSON body with an error field containing a human-readable description.

For robust integrations, implement retry logic with exponential backoff for transient errors (500 status codes) and handle model-not-found errors (404) by triggering automatic model pulls. This pattern is especially useful for applications that use multiple models and may encounter situations where a model has been removed or not yet downloaded.

Server Configuration

The Ollama API server is configured primarily through environment variables. OLLAMA_HOST controls the listen address and port (default 127.0.0.1:11434). Set this to 0.0.0.0:11434 to accept connections from other machines on your network, or to a different port if 11434 conflicts with another service.

OLLAMA_ORIGINS controls CORS (Cross-Origin Resource Sharing) for browser-based applications. By default, Ollama allows requests from localhost origins. Set this to a comma-separated list of allowed origins, or to * to allow requests from any origin. This is necessary when building web applications that call the Ollama API from JavaScript running in a browser.

OLLAMA_MODELS sets the directory where model files are stored (default is ~/.ollama/models). Change this to a different directory or disk if your default storage location has limited space. The directory must exist and be writable by the Ollama process.

Key Takeaway

The Ollama API provides both native endpoints and OpenAI-compatible endpoints for maximum integration flexibility. Use the native endpoints for full feature access, and the /v1 compatible endpoint to integrate with existing tools and libraries designed for the OpenAI API.

Generation Endpoints

Embedding Endpoint

Model Management Endpoints

OpenAI-Compatible Endpoint

Streaming and Error Handling

Server Configuration

Related Articles

How to Use Ollama with Python

How to Run Embeddings with Ollama

Using Ollama with AI Agent Systems

Running Ollama in Docker

AI Tool Calling