How to Use Ollama with Python

Updated May 2026
The official Ollama Python client library provides a clean, Pythonic interface for generating text, managing chat conversations, creating embeddings, and controlling models programmatically. It handles connection management, streaming, and response parsing so you can focus on building your application logic rather than managing HTTP requests and JSON formatting.

Python is the most popular language for AI development, and Ollama's Python library makes local model integration feel native. The library mirrors the REST API closely but adds conveniences like automatic streaming iteration, typed response objects, and a consistent interface across all endpoints. Whether you are building a chatbot, a content pipeline, or a code analysis tool, the Python client gives you full access to every Ollama capability.

Install the Ollama Python Library

Install the official package from PyPI with pip install ollama. The package has minimal dependencies and works with Python 3.8 and later. For projects using virtual environments, activate your environment first, then run the install command. The package name is simply ollama, matching the tool itself.

After installation, verify the setup by opening a Python shell and running import ollama followed by ollama.list(). This should return a list of models installed on your system, confirming that the library can communicate with the Ollama server. Make sure the Ollama server is running before executing any library calls, as the client connects to http://localhost:11434 by default.

For projects that need to connect to a remote Ollama server or a non-default port, create a custom client instance with ollama.Client(host='http://your-server:11434'). The custom client provides the same methods as the module-level functions but directs requests to the specified host. This is useful for development teams running a shared Ollama server or for applications that separate the model server from the application server.

Generate Text with ollama.generate()

The ollama.generate() function handles single-turn text generation. Pass a model name and a prompt string, and it returns a response object containing the generated text, token counts, and timing information. For example, response = ollama.generate(model='qwen3:14b', prompt='Explain recursion in Python') sends the prompt to Qwen3 14B and stores the complete response.

Access the generated text through response['response']. The response dictionary also includes total_duration (total processing time in nanoseconds), eval_count (number of tokens generated), eval_duration (time spent generating tokens), and prompt_eval_count (number of tokens in the prompt). These metrics are useful for monitoring performance and estimating costs in production applications.

Control generation behavior by passing additional parameters. Set options={'temperature': 0.2, 'num_ctx': 8192} for deterministic output with a larger context window. Use options={'num_predict': 500} to limit the response length. The options dictionary accepts any parameter that the underlying model supports, giving you fine-grained control over generation without creating custom Modelfiles.

Manage Conversations with ollama.chat()

The ollama.chat() function handles multi-turn conversations using a messages array. Each message is a dictionary with role (system, user, or assistant) and content fields. This format matches the OpenAI chat convention, making it familiar to developers who have worked with cloud APIs. Pass the full conversation history with each call, as the server does not maintain session state.

Build a conversation by maintaining a messages list in your application. Start with an optional system message that sets the model's behavior, add the user's input as a user message, call ollama.chat(), then append the assistant's response to the list. For the next turn, add the new user message and call chat again with the full history. This pattern gives you complete control over conversation management and context.

A basic interactive chatbot requires fewer than 20 lines of Python. Initialize the messages list, loop on user input, append each message, call ollama.chat(), print the response, and append the assistant reply. Add a system message at the start to customize the assistant's personality. Add conversation length management by trimming older messages when the list grows beyond a token budget, keeping the system message and most recent exchanges.

Handle Streaming Responses

For real-time output, pass stream=True to either generate() or chat(). This returns an iterator that yields response chunks as the model generates them. Each chunk contains a small portion of text (usually one token) in the response or message.content field, depending on which function you called. Print each chunk immediately to display text as it appears, creating a responsive user experience.

Streaming is important for user-facing applications because it eliminates the wait time between sending a request and seeing the first output. Without streaming, the user sees nothing until the entire response is generated, which can take several seconds for long responses. With streaming, the first token appears almost immediately, and subsequent tokens flow continuously.

Collect streamed chunks into a complete response by concatenating the text fragments. Initialize an empty string before the loop, append each chunk's text content during iteration, and use the complete string after the loop finishes. The final chunk in the stream includes the same metadata (token counts, duration) that non-streaming responses provide, so you can log performance metrics after streaming completes.

Creating Embeddings

Generate text embeddings with ollama.embed(model='nomic-embed-text', input='Your text here'). The function returns a dictionary with an embeddings field containing a list of vectors. For batch processing, pass a list of strings as the input parameter, and you receive one embedding vector per input string. Batch processing is significantly faster than making individual calls for each text.

Common embedding models available through Ollama include nomic-embed-text (768 dimensions, good general purpose), mxbai-embed-large (1024 dimensions, higher accuracy), and snowflake-arctic-embed (various sizes). Choose based on your accuracy requirements and storage constraints. Store the resulting vectors in a vector database like ChromaDB, Qdrant, or pgvector for similarity search and retrieval-augmented generation (RAG) applications.

A practical embedding workflow reads documents from a directory, splits them into chunks, generates embeddings for each chunk, and stores the vectors alongside the original text in a vector database. At query time, embed the user's question with the same model, search the vector database for the most similar chunks, and pass those chunks as context to a chat model for answer generation. This RAG pattern is one of the most common use cases for local embeddings.

Model Management from Python

The Python library provides full model lifecycle management. ollama.list() returns all installed models with their names, sizes, and metadata. ollama.pull('qwen3:14b') downloads a model from the library. ollama.delete('model-name') removes a model. ollama.show('model-name') returns the model's Modelfile, parameters, and template configuration.

Create custom models programmatically with ollama.create() by passing a model name and a Modelfile string. This is powerful for applications that generate custom model configurations dynamically, such as creating per-user model configurations with different system prompts or building a model management interface where users can customize parameters through a GUI.

Check which models are currently loaded in memory with ollama.ps(), which returns the model name, memory usage, and processor allocation for each active model. This is useful for server management scripts that monitor resource usage and unload idle models to free GPU memory for other tasks.

Async Support

The Ollama library includes an async client at ollama.AsyncClient for use with Python's asyncio. Create an instance with client = ollama.AsyncClient() and call methods with await. For example, response = await client.chat(model='qwen3:14b', messages=messages). Async streaming uses async for iteration over the response object.

Async support is essential for web applications built with FastAPI, aiohttp, or other async frameworks. Without async, a synchronous Ollama call blocks the entire event loop while waiting for generation to complete, preventing the server from handling other requests. The async client runs the model call in the background, freeing the event loop to process concurrent requests.

For applications that need to query multiple models or process multiple prompts simultaneously, async enables true parallelism. Use asyncio.gather() to send requests to multiple models at the same time and collect all responses when they complete. This pattern is useful for ensemble approaches, A/B testing different models, or processing batch jobs where multiple prompts need independent responses.

Error Handling and Best Practices

The library raises ollama.ResponseError for API errors, which includes the HTTP status code and error message. Wrap Ollama calls in try/except blocks and handle common error cases: model not found (pull it automatically or prompt the user), connection refused (Ollama server is not running), and out-of-memory errors (try a smaller model or reduce context size).

For production applications, implement connection health checks before making generation requests. A simple ollama.list() call at startup confirms the server is reachable. Log response metrics (token counts, generation duration) to monitor performance over time. Set reasonable timeouts for long-running generation requests to prevent your application from hanging if the model encounters an issue.

Keep the Ollama Python library updated with pip install --upgrade ollama to receive new features and bug fixes. The library follows semantic versioning, so minor version updates are backward compatible. Check the changelog before major version upgrades, as breaking changes may require code updates in your application.

Key Takeaway

The Ollama Python library provides a clean interface for text generation, chat, embeddings, and model management. Use ollama.chat() for conversations, ollama.generate() for single prompts, and ollama.embed() for vector embeddings. Add stream=True for real-time output in user-facing applications.