GPU vs CPU for Local AI: What You Need
Why GPUs Are Faster for AI Inference
Language model inference is fundamentally a matrix multiplication workload. Each token the model generates requires multiplying large matrices of numbers, specifically the model weights against the input data. CPUs process these operations sequentially or with limited parallelism (8 to 24 cores on a modern desktop CPU). GPUs process them with massive parallelism (thousands of cores designed specifically for matrix math).
An NVIDIA RTX 3060 has 3,584 CUDA cores and can perform trillions of floating-point operations per second. A modern desktop CPU has 8 to 16 cores and performs billions. This architectural difference translates directly into token generation speed: GPUs produce tokens 5 to 20 times faster than CPUs for the same model.
The critical requirement is that the model must fit in the GPU VRAM to get full acceleration. If the model exceeds VRAM capacity, Ollama automatically splits it between GPU and CPU, which provides partial acceleration but not the full speed benefit. The layers in GPU memory process fast, but the layers in CPU memory create a bottleneck.
Memory bandwidth is another important factor that often gets overlooked. GPUs have much higher memory bandwidth than system RAM, meaning they can read the model weights faster during each inference step. An RTX 3060 provides 360 GB/s of bandwidth, while typical DDR4 system RAM provides 25 to 50 GB/s. Since each token generation requires reading the entire model weights once, this bandwidth difference directly impacts how many tokens the GPU can produce per second.
Real-World Performance Numbers
These benchmarks reflect typical performance for a quantized 7B to 8B parameter model (Q4_K_M format) in mid-2026.
CPU-only (modern desktop): 5 to 15 tokens per second. Usable for interactive chat, noticeably slower than cloud services. Responses for short queries appear in 2 to 5 seconds, while longer responses take proportionally more time. Adequate for occasional use and testing.
NVIDIA RTX 3060 (12 GB VRAM): 30 to 50 tokens per second. Feels responsive and fluid. Short responses appear nearly instantly, and long responses stream at a comfortable reading pace. This is the entry-level GPU experience that transforms local AI from functional to pleasant.
NVIDIA RTX 4090 (24 GB VRAM): 50 to 80 tokens per second for 8B models, 15 to 30 tokens per second for 30B models. The high-end consumer experience, where responses feel instantaneous for small models and very responsive for large ones.
Apple M4 (16-32 GB unified): 12 to 25 tokens per second for 8B models, using the unified memory architecture for GPU acceleration through the Metal framework. Slower than dedicated NVIDIA GPUs but faster than CPU-only and with the advantage of accessing all system memory as potential VRAM.
For larger models (13B to 70B parameters), all numbers decrease proportionally. A 70B model runs at roughly 1 to 3 tokens per second on CPU only, which is usable for batch processing but frustrating for interactive chat. On a high-end GPU with sufficient VRAM, the same model generates 15 to 25 tokens per second.
VRAM: The GPU Bottleneck That Matters Most
VRAM (video RAM) is the GPU's dedicated memory, and it is the single most important GPU specification for local AI. The model weights must fit in VRAM for full GPU acceleration. If the model exceeds VRAM capacity, performance drops significantly because data must shuttle between the fast GPU memory and the slower system RAM over the PCIe bus.
A quantized 8B parameter model (Q4_K_M format, the default in Ollama) uses approximately 5 to 6 GB of VRAM. A 6 GB GPU can run it, but 8 GB provides more headroom for context window memory. A 12 GB GPU handles 8B models comfortably with room for large context windows and even some 13B models at aggressive quantization levels.
At the 30B parameter tier, you need 18 to 22 GB of VRAM for full GPU offloading. This puts you in RTX 4090 (24 GB) territory at the consumer level. For 70B models, you need 40 to 48 GB, which requires professional cards like the A6000 or multiple consumer GPUs working together.
When the model partially fits in VRAM, Ollama splits the model layers between GPU and CPU. The GPU-resident layers process quickly, but the CPU-resident layers create a bottleneck that reduces overall speed. A model that is 70% offloaded to GPU performs roughly 3 to 4 times faster than pure CPU, but significantly slower than full GPU offloading. Partial offloading is a useful middle ground when your VRAM is slightly too small for a model you want to run.
Context window memory also consumes VRAM. Each token in the conversation history uses a small amount of VRAM for the key-value cache. With default 8K context windows this is negligible, but extending context to 32K or 128K tokens can consume several additional gigabytes. If you plan to use long context windows, account for this when choosing your GPU.
When CPU-Only Is Good Enough
CPU-only inference is adequate in several scenarios. If you are running small models (1B to 4B parameters), CPU performance is fast enough that most users do not notice a significant difference from GPU. Phi-4 Mini and Gemma 2B generate 15 to 50+ tokens per second on a modern CPU, which feels responsive for interactive use.
For batch processing and automation where response time is not critical (processing documents, generating data, running overnight tasks), CPU-only inference for 7B to 8B models works well. The model is slower per token but can run indefinitely without tying up a GPU.
CPU-only is also appropriate for evaluation and testing. If you want to try local AI before investing in a GPU, running on CPU lets you evaluate model quality, experiment with different models, and learn the tools without any hardware purchase. The output quality is identical whether you run on CPU or GPU, only the speed differs.
Finally, Apple Silicon Macs blur the line between CPU and GPU inference. The unified memory architecture means the GPU is always involved in inference on these machines, providing acceleration without requiring a discrete GPU card.
When You Need a Dedicated GPU
A dedicated GPU becomes important when you use AI frequently throughout the day and want responses to feel instant, when you run models at 13B+ parameters that are painfully slow on CPU, when you integrate AI into interactive workflows like code completion or real-time writing assistance, or when you run multiple simultaneous inference sessions.
The upgrade from CPU to GPU is the single most impactful performance improvement you can make for local AI. It transforms the experience from "functional but slow" to "fast and natural." If you use local AI daily, a GPU is a strong investment.
Recommended GPUs for Local AI
Budget pick: NVIDIA RTX 3060 12 GB ($250 to $300 used). The best value entry point with enough VRAM for 8B models and solid performance. This is the GPU most commonly recommended in the local AI community.
Mid-range: NVIDIA RTX 4060 Ti 16 GB ($400 to $500). More VRAM for larger models and better power efficiency. Handles 13B models comfortably.
High-end: NVIDIA RTX 4090 24 GB ($1,500 to $1,800). The fastest consumer GPU for AI inference, with enough VRAM for quantized 30B+ models. Overkill for 8B models but excellent for larger models.
Professional: NVIDIA A6000 (48 GB) or used A100 (40/80 GB). For running 70B+ models entirely in VRAM. Expensive but the only consumer-accessible option for full GPU acceleration of the largest models.
AMD GPUs (RX 7900 XTX with 24 GB, for example) work with Ollama through ROCm but historically have less polished support than NVIDIA. If you already own an AMD GPU, it is worth trying, but for a new purchase specifically for AI, NVIDIA remains the safer choice.
Cost Analysis: GPU Investment vs Cloud Spending
The financial case for a GPU depends on how much you currently spend (or would spend) on cloud AI services and how frequently you use AI throughout the day.
A $20 per month ChatGPT Plus subscription costs $240 per year. A $250 used RTX 3060 pays for itself in about 12 months of daily use, and then provides free inference indefinitely. For users who subscribe to multiple services or use API-based pricing that exceeds $40 to $50 per month, the payback period is even shorter.
API-based pricing makes the comparison more dramatic for heavy users. Cloud API costs typically range from $3 to $15 per million input tokens and $15 to $75 per million output tokens for frontier models. A developer making 100+ AI queries per day at these rates can easily spend $100 to $300 per month. A one-time $250 to $500 GPU investment eliminates these ongoing costs entirely for the majority of queries.
The counter-argument is that GPU hardware depreciates, consumes electricity, and requires a compatible desktop system. If you primarily use a laptop, adding a GPU is not practical (external GPU enclosures exist but add cost and complexity). In this case, Apple Silicon Macs offer a middle ground with their unified memory GPU acceleration, or you can run a separate desktop as a local AI server accessible from your laptop over the network.
Electricity costs for GPU inference are modest. An RTX 3060 consumes 170 watts at full load during inference. At typical US electricity rates ($0.12 to $0.15 per kWh), running inference for 4 hours per day costs roughly $2 to $3 per month in electricity, which is negligible compared to cloud subscription savings.
A GPU makes local AI 5 to 20 times faster. The RTX 3060 12 GB at $250 to $300 used is the best value upgrade. CPU-only works for small models and occasional use, but daily users benefit significantly from GPU acceleration. The investment pays for itself within a year for most regular users.