Hardware Requirements for Ollama

Updated May 2026
Running Ollama effectively requires enough memory to hold your chosen model, with GPU VRAM being the most important factor for performance. An 8B model needs about 6GB of VRAM, a 14B model needs 10GB, a 32B model needs 20 to 22GB, and a 70B model needs 43GB or more. This guide covers specific hardware recommendations for every budget and use case.

Memory Is the Bottleneck

Local model inference is fundamentally memory-bound. The model weights must be stored in memory during inference, and the amount of available GPU VRAM determines which models you can run at full speed. CPU, disk speed, and other factors are secondary to having enough memory to hold the model entirely in GPU-accessible memory.

At the standard Q4_K_M quantization level, memory requirements break down predictably by model size. A 1 to 3B model needs 1 to 3GB. A 7 to 8B model needs 5 to 6GB. A 14B model needs 9 to 10GB. A 32B model needs 20 to 22GB. A 70B model needs 42 to 45GB. These figures represent the model weights only and do not include the 1 to 4GB of additional memory needed for the KV cache during inference, which scales with context length.

The practical implication is straightforward: add 2 to 3GB to the model weight size for the KV cache overhead, and if that total fits in your GPU VRAM, you will get full-speed inference. If it does not, you will experience significant slowdown from CPU offloading.

NVIDIA GPU Recommendations

For the 8B model tier, the RTX 3060 12GB and RTX 4060 8GB both handle 8B models well. The 12GB card provides headroom for longer context windows and leaves room for the KV cache, making it the better choice if you plan to use contexts longer than 4096 tokens. Budget-conscious buyers can consider the RTX 3060 12GB, which remains widely available at reasonable prices on the used market.

For the 14B tier, the RTX 4060 Ti 16GB is the sweet spot, fitting 14B models comfortably with room for a generous KV cache. The RTX 4070 12GB works for 14B models at Q4_K_M but leaves less headroom. If you frequently need longer contexts, the 16GB card is worth the premium.

For the 32B tier, the RTX 4090 with 24GB is the consumer card of choice. It fits 32B models at Q4_K_M with a few GB to spare for the KV cache. The RTX A5000 with 24GB offers similar VRAM in a workstation form factor. For the 70B tier, professional cards like the RTX A6000 with 48GB or the H100 with 80GB are necessary for single-GPU inference.

Multi-GPU configurations work with Ollama, distributing model layers across two or more cards. Two RTX 3090s with 24GB each provide 48GB of combined VRAM, enough for 70B models. However, the cross-GPU communication overhead reduces efficiency compared to a single card with equivalent VRAM, so a single 48GB card outperforms two 24GB cards in most scenarios.

Apple Silicon Recommendations

Apple Silicon machines offer a uniquely efficient platform for Ollama thanks to unified memory architecture. The GPU accesses the same physical memory as the CPU with no copy overhead, effectively turning all system RAM into GPU-accessible VRAM. This means a MacBook Pro with 32GB of unified memory can run models that would require 32GB of dedicated VRAM on an NVIDIA system.

The base M2 or M3 with 8GB handles 8B models but leaves minimal headroom. The M2 Pro or M3 Pro with 18GB comfortably runs 8B models and can handle 14B models with careful context management. The M2 Max or M3 Max with 32GB runs 14B models easily and handles 32B models. The M4 Max with 48GB or 64GB and the M4 Ultra with 128GB or 192GB reach into territory where even the largest open source models run at full speed.

Memory bandwidth is the limiting factor on Apple Silicon rather than raw VRAM capacity. The M2 Max achieves about 400 GB/s, the M3 Max about 400 GB/s, and the M4 Ultra up to 800 GB/s. Higher bandwidth translates directly to faster token generation, so the Ultra chips deliver significantly better inference speed than the Max chips even when both have sufficient memory for the same model.

AMD GPU Support

Ollama supports AMD GPUs through the ROCm framework. The RX 7900 XTX with 24GB and the Radeon PRO W7900 with 48GB are the most capable consumer and professional AMD options. ROCm support has improved substantially since 2024, and most popular models run correctly on AMD hardware, though NVIDIA GPUs still receive more testing and optimization attention from the Ollama team.

AMD users should verify ROCm compatibility for their specific GPU model before purchasing hardware for Ollama. The ROCm framework supports a narrower range of GPU architectures than CUDA, and some older or lower-end AMD GPUs lack support entirely. The RX 7000 series and newer professional cards generally have the best support.

CPU-Only Operation

Ollama runs on CPU alone for machines without supported GPUs, but performance is dramatically lower. A modern CPU like an Intel Core i9 or AMD Ryzen 9 with 64GB of system RAM can run a 32B model, but at 3 to 8 tokens per second compared to 20 to 40 tokens per second on a capable GPU. This is usable for testing, light experimentation, and non-interactive batch processing, but not practical for responsive interactive use.

System RAM requirements for CPU-only operation match the model memory needs, plus whatever your operating system and other applications require. Running a 14B model at Q4_K_M needs about 10GB for the model, plus 2 to 3GB for the KV cache, plus 4 to 8GB for your OS and applications, totaling roughly 16 to 20GB minimum. Having more RAM than the minimum allows the operating system to cache model files for faster subsequent loads.

Building a Budget Inference Machine

For a dedicated local inference setup on a budget, the most cost-effective approach is a used or refurbished workstation with a consumer GPU. A system with a Ryzen 5 or Intel Core i5, 32GB of system RAM, and an RTX 3060 12GB provides a capable platform for 8B to 14B models at a total cost around $500 to $700. The GPU does the heavy lifting, so the CPU and other components do not need to be high-end.

Upgrading the GPU is the single most impactful improvement you can make. Moving from an 8GB card to a 12GB card opens the 14B tier. Moving to a 24GB card opens the 32B tier. Each tier jump provides a meaningful quality improvement in model output. Storage speed matters less than you might expect because models load into RAM once and then inference runs entirely from memory, making SSD versus NVMe differences negligible for inference speed.

For Mac users, refurbished M2 Max MacBook Pros with 32GB offer strong local inference capability in a laptop form factor. The unified memory architecture makes them competitive with discrete GPU setups at the 14B to 32B model tier, and the portability means you can run models anywhere without depending on a desktop setup.

Key Takeaway

GPU VRAM is the most critical factor for Ollama performance. Choose hardware that fits your target model entirely in GPU memory: 8GB for 8B models, 16GB for 14B models, 24GB for 32B models, and 48GB+ for 70B models. Apple Silicon's unified memory architecture makes Macs uniquely efficient for local inference.