How to Install and Run Ollama

Updated May 2026
Installing Ollama takes under five minutes on any operating system. You download the installer from ollama.com, run it, then type a single command to download and start chatting with a local AI model. This guide walks through every step from installation to running your first model, verifying GPU acceleration, and managing your model library.

Ollama is the most popular tool for running AI models locally. It handles model downloading, memory management, GPU acceleration, and inference in a single package. Once installed, it runs as a background service that you interact with through the command line or through frontend applications like Open WebUI. The installation process is straightforward on macOS, Windows, and Linux.

Download and Install Ollama

Go to ollama.com and download the installer for your operating system. Ollama provides native installers for macOS, Windows, and Linux.

macOS: Download the .dmg file and drag Ollama to your Applications folder. When you launch it for the first time, Ollama installs a command-line tool and starts a background service. You will see an Ollama icon in your menu bar indicating the service is running.

Windows: Download the .exe installer and run it. The installer sets up Ollama as a Windows service that starts automatically. After installation, you can access Ollama from any command prompt or PowerShell window.

Linux: The fastest method is the install script: open a terminal and run curl -fsSL https://ollama.com/install.sh | sh. This downloads and installs Ollama, sets it up as a systemd service, and starts it automatically. For manual installation or distribution-specific packages, see the Ollama GitHub repository.

Verify the Installation

Open a terminal (Terminal on macOS, Command Prompt or PowerShell on Windows, any terminal on Linux) and run:

ollama --version

You should see a version number like ollama version 0.6.x. If you get a "command not found" error, the installation did not complete correctly or the command-line tool is not in your system PATH. On macOS, try launching the Ollama application from your Applications folder first, as this installs the command-line component. On Linux, log out and log back in to refresh your PATH.

You can also verify the Ollama service is running by visiting http://localhost:11434 in a browser. You should see a message confirming Ollama is running. This is the API endpoint that Ollama uses to receive requests.

Download and Run Your First Model

Run the following command to download and immediately start chatting with Qwen 3 8B, a strong general-purpose model:

ollama run qwen3:8b

The first time you run this command, Ollama downloads the model (approximately 4.9 GB for the Q4_K_M quantized version). The download typically completes in a few minutes depending on your internet speed. Once downloaded, the model loads into memory and you see a prompt where you can type messages and receive responses.

Type a question or prompt and press Enter. The model generates a response, streaming it word by word in the terminal. To exit the chat session, type /bye and press Enter.

If the model fails to load with an out-of-memory error, your system does not have enough RAM for this model size. Try a smaller model instead: ollama run phi4-mini (requires approximately 2.5 GB of RAM) or ollama run qwen3:1.7b (requires approximately 1.5 GB).

Verify GPU Acceleration

While a model is loaded (either in a chat session or within the default 5-minute keep-alive window after your last request), run this command in a separate terminal:

ollama ps

This shows which models are currently loaded, how much memory they are using, and whether they are running on GPU or CPU. The output includes a "Processor" column that indicates where the model is running. You want to see "GPU" or a percentage indicating partial GPU offloading.

If the model is running on CPU when you expect GPU acceleration, check that your GPU drivers are up to date. NVIDIA users need current CUDA drivers. AMD users need ROCm. Apple Silicon Macs use Metal automatically and should always show GPU acceleration. If GPU acceleration is not working, the model still runs correctly on CPU, just at lower speed.

Explore Additional Models

Ollama provides access to hundreds of models. Here are the essential commands for managing your model library:

ollama list shows all models currently downloaded on your system, their sizes, and when they were last modified.

ollama pull llama3.3:8b downloads a model without starting a chat session. Useful for pre-downloading models you plan to use later.

ollama rm modelname:tag deletes a model from your system and frees the disk space.

ollama show modelname:tag displays detailed information about a model, including its parameters, template, and system prompt.

Popular models to try include llama3.3:8b for general purpose use, qwen3-coder:8b for programming, and deepseek-r1:8b for reasoning tasks. Each model has different strengths, so experimenting with several helps you find the right fit for your workflow.

Basic Configuration and Troubleshooting

Ollama works with default settings for most users, but a few configuration options are worth knowing about.

Model storage location: Ollama stores downloaded models in your home directory by default (~/.ollama/models on macOS and Linux, C:\Users\username\.ollama\models on Windows). If your home drive is small, you can change this by setting the OLLAMA_MODELS environment variable to a path on a larger drive before starting Ollama.

Context window size: The default context window is 8192 tokens. To increase it for longer conversations, set the num_ctx parameter when running a model: ollama run qwen3:8b --num_ctx 16384. Larger context windows use more memory, so increase this only if you need longer conversation history.

Keep-alive time: Ollama keeps models loaded in memory for 5 minutes after the last request by default. You can change this with the OLLAMA_KEEP_ALIVE environment variable. Set it to 0 to unload models immediately, or -1 to keep them loaded indefinitely. To manually unload a model, run ollama stop modelname.

Network access: By default, Ollama only listens on localhost (127.0.0.1). If you want other devices on your network to access your Ollama server (for example, to use Open WebUI from a phone or tablet), set the OLLAMA_HOST environment variable to 0.0.0.0. Be aware that this exposes your Ollama instance to all devices on your local network.

Common issues: If Ollama fails to start, check that no other service is using port 11434. If model downloads fail or stall, check your internet connection and try again. Partial downloads resume automatically. If a model generates garbled output, the download may have been corrupted, so delete it with ollama rm and re-download it.

API usage: Once Ollama is running, it serves an API on port 11434 that other applications can use. This API is compatible with the OpenAI Chat Completions format, meaning tools and scripts built for the OpenAI API often work with Ollama by simply changing the base URL to http://localhost:11434. This compatibility is what allows Open WebUI, IDE extensions, and custom applications to work with your local models seamlessly.

Key Takeaway

Installing Ollama is three steps: download from ollama.com, install, then run ollama run qwen3:8b. The entire process from zero to chatting with a local AI model takes under five minutes on a decent internet connection.