Ollama GPU Setup and Configuration

Updated May 2026
Ollama automatically detects and uses compatible GPUs for accelerated inference, but getting the best performance requires proper driver installation and understanding how to configure GPU-related settings. This guide covers NVIDIA CUDA, AMD ROCm, and Apple Metal setup, along with multi-GPU configuration and troubleshooting for common GPU issues.

NVIDIA GPU Setup

NVIDIA GPUs are the most widely used and best-supported option for Ollama. The primary requirement is having the NVIDIA driver installed on your system. Ollama includes its own CUDA libraries bundled in the installation, so you do not need to install the CUDA Toolkit separately. The NVIDIA driver handles communication between the operating system and the GPU, while Ollama's bundled CUDA libraries handle the computation.

On Linux, install the NVIDIA driver through your distribution's package manager. Ubuntu users can run sudo apt install nvidia-driver-550 (or the latest available version). Verify the installation with nvidia-smi, which should display your GPU model, driver version, and CUDA version. Ollama requires NVIDIA driver version 450 or newer, though using the latest available driver is always recommended for the best performance and compatibility.

On Windows, download the NVIDIA driver from the NVIDIA website or use GeForce Experience to keep drivers updated. The driver installs the necessary components for CUDA acceleration. After installing both the driver and Ollama, run ollama run llama4 and check the output for confirmation that GPU acceleration is active. The generation speed will indicate whether the GPU is being used, as GPU inference is typically 5 to 10 times faster than CPU-only.

If Ollama does not detect your NVIDIA GPU, check that the driver is installed correctly by running nvidia-smi. Common issues include outdated drivers, drivers not loaded after a kernel update (requiring a reboot), or display manager conflicts on Linux. Setting the environment variable OLLAMA_DEBUG=1 produces verbose logging that shows GPU detection details and can help identify the issue.

AMD GPU Setup

AMD GPU support in Ollama works through the ROCm (Radeon Open Compute) framework. ROCm support has improved significantly since 2024, with most modern AMD GPUs now working reliably with Ollama. The RX 7000 series (RDNA 3) and Radeon PRO W7000 series have the best support, while older architectures may have limited compatibility.

On Linux, install the ROCm stack through AMD's official packages. The installation includes the ROCm runtime, HIP libraries, and device drivers. After installation, verify the setup with rocm-smi, which should display your GPU information. Add your user to the render and video groups to enable GPU access without root privileges.

Set the HSA_OVERRIDE_GFX_VERSION environment variable if your specific GPU model is not recognized by default. This variable tells the ROCm runtime to treat your GPU as a compatible architecture, which can resolve detection issues with GPUs that are architecturally similar to officially supported models but not explicitly listed in the compatibility table.

AMD GPU performance with Ollama is generally competitive with NVIDIA for single-user inference, though NVIDIA remains ahead in raw throughput for certain model architectures and quantization types. The RX 7900 XTX with 24GB of VRAM provides a strong AMD option for running 14B to 32B models at full speed.

Apple Silicon Setup

Apple Silicon Macs require no GPU setup at all. Ollama uses the Metal framework automatically on M-series chips, and the unified memory architecture means there is no separate GPU memory to manage. Every byte of system RAM is accessible to the GPU, making Apple Silicon one of the most straightforward platforms for running Ollama.

The key consideration on Apple Silicon is total system memory, since it determines both the largest model you can run and the speed at which it generates tokens. Memory bandwidth, which varies between chip tiers, determines generation speed. The M2 Max at 400 GB/s, the M3 Max at 400 GB/s, and the M4 Ultra at up to 800 GB/s represent the spectrum from capable to excellent for local inference.

Install Ollama on macOS by downloading the application from ollama.com or installing through Homebrew. The application runs as a menu bar item and starts the API server automatically. No driver installation, no framework configuration, and no environment variables are needed for GPU acceleration to work.

Multi-GPU Configuration

Ollama supports distributing a model across multiple GPUs when a single GPU does not have enough VRAM. This happens automatically when you have multiple GPUs and the model exceeds the VRAM of any single card. Ollama splits the model layers across available GPUs, with each GPU processing its assigned layers in sequence.

To control which GPUs Ollama uses, set the CUDA_VISIBLE_DEVICES environment variable on NVIDIA systems or HIP_VISIBLE_DEVICES on AMD systems. For example, CUDA_VISIBLE_DEVICES=0,1 makes only the first two GPUs visible to Ollama, useful when you want to reserve other GPUs for different applications.

Multi-GPU inference has overhead from communication between GPUs, so two 12GB GPUs will not perform as well as a single 24GB GPU for the same model. However, multi-GPU is still significantly faster than CPU offloading, making it a practical approach when you have multiple smaller GPUs and need to run a model that does not fit on any single card.

VRAM Allocation and Monitoring

Understanding VRAM usage helps you make informed decisions about model selection and configuration. Use nvidia-smi on NVIDIA or rocm-smi on AMD to see current VRAM consumption. On macOS, the Activity Monitor's GPU panel shows memory usage. The ollama ps command displays which models are loaded and how much memory they occupy.

When multiple applications compete for VRAM, Ollama may not be able to load a model that would otherwise fit. Close GPU-consuming applications like video editors, 3D renderers, or games to free VRAM for Ollama. Some desktop environments also consume VRAM for compositor effects, which can reduce available memory by 200 to 500MB.

The OLLAMA_GPU_OVERHEAD environment variable (where supported) lets you reserve a portion of VRAM for other applications, preventing Ollama from consuming all available GPU memory. This is useful on machines where you need to run Ollama alongside other GPU-accelerated applications.

Troubleshooting Common Issues

If models run slower than expected, the most common cause is partial CPU offloading. Check the model loading log for messages about layer placement, which indicate how many layers are on the GPU versus CPU. If layers are on the CPU when you have sufficient VRAM, other applications may be consuming GPU memory, or the VRAM calculation may include overhead that pushes the model past your limit.

Out-of-memory errors during model loading mean the model and KV cache together exceed available VRAM. Solutions include using a smaller quantization level (Q4_K_M instead of Q5_K_M), reducing the context window size with num_ctx, choosing a smaller model variant, or closing other GPU-consuming applications to free VRAM.

On Linux, if Ollama stops detecting the GPU after a system update, the kernel may have been updated without rebuilding the NVIDIA driver module. Running sudo dkms autoinstall or reinstalling the NVIDIA driver usually resolves this. A system reboot is required after driver reinstallation.

Key Takeaway

NVIDIA GPUs need only a driver install for Ollama GPU acceleration. AMD requires the ROCm stack. Apple Silicon works automatically with zero setup. For all platforms, the key optimization is ensuring your model fits entirely in GPU memory to avoid the dramatic performance penalty of CPU offloading.