What You Need to Run AI Locally
Minimum Hardware Requirements
The absolute minimum to run a useful local AI model is 8 GB of system RAM, a 64-bit processor from the last six years, and roughly 5 to 10 GB of free storage space. This configuration runs small models (1B to 3B parameters) at reasonable speeds and can handle basic question answering, simple text generation, and lightweight coding assistance. Most computers sold since 2020 meet these requirements.
At the minimum tier, you are running quantized small models like Phi-4 Mini, Gemma 2B, or Qwen 3 0.6B. These models are surprisingly capable for their size, handling everyday tasks competently. They generate responses at 10 to 50+ tokens per second on CPU alone, which means instant feedback for most interactions. The tradeoff is that they produce less nuanced output than larger models and struggle with complex reasoning or long-form content generation.
Recommended Hardware for Comfortable Use
For a comfortable local AI experience that handles the tasks most people actually need, you want 16 GB of RAM, a modern multi-core processor, and at least 50 GB of free SSD storage. This configuration runs 7B to 8B parameter models, which is the sweet spot for quality versus hardware cost. Models like Qwen 3 8B, Llama 3.3 8B, and Mistral Small 3 at this size deliver genuinely useful results across coding, writing, analysis, and conversation.
With 16 GB of RAM and CPU-only inference, an 8B model generates 5 to 15 tokens per second. This is noticeably slower than cloud services but entirely usable for interactive work. Responses appear at roughly the speed you can read them. For most people, this is the entry point that makes local AI practical rather than experimental.
Adding a dedicated GPU transforms the experience. An NVIDIA RTX 3060 with 12 GB of VRAM costs around $250 to $300 on the used market and accelerates 8B model inference to 30 to 60 tokens per second. At that speed, responses feel instantaneous. If your computer has a compatible GPU, enabling GPU acceleration is the single most impactful upgrade you can make.
GPU: The Speed Multiplier
A dedicated GPU is not required but makes a massive difference in response speed. GPUs are designed for the parallel matrix operations that language model inference relies on, making them 5 to 20 times faster than CPUs for this workload.
The key metric for local AI is VRAM (Video RAM), not the GPU compute performance itself. The model needs to fit in VRAM for full GPU acceleration. If the model exceeds your VRAM capacity, Ollama automatically splits it between GPU and CPU memory, which still provides acceleration but not as much as full GPU offloading.
Here are the practical VRAM tiers for local AI in 2026: 4 GB of VRAM handles 1B to 3B models comfortably. 8 GB of VRAM is the entry point for 7B to 8B models with Q4 quantization. 12 GB of VRAM gives comfortable headroom for 8B models and fits some 13B models. 16 GB of VRAM runs 13B to 14B models. 24 GB of VRAM opens the door to quantized 30B+ models. And 40 to 48 GB of VRAM handles 70B class models, though this requires professional-grade hardware.
NVIDIA GPUs currently offer the best compatibility with local AI tools. The RTX 3060 (12 GB), RTX 3090 (24 GB), RTX 4060 Ti (16 GB), and RTX 4090 (24 GB) are the most commonly recommended cards. AMD GPUs work with Ollama but have historically had rougher compatibility. Intel Arc GPUs have emerging support but are not yet recommended for beginners.
Apple Silicon: A Special Case
Apple Silicon Macs (M1, M2, M3, M4 series) deserve dedicated attention because their unified memory architecture fundamentally changes the local AI equation. On a traditional PC, the GPU has its own dedicated VRAM separate from system RAM. On Apple Silicon, the CPU and GPU share a single pool of memory, which means a Mac with 32 GB of unified memory can use all 32 GB for model inference.
This makes Apple Silicon Macs exceptionally cost-effective for running larger models. A Mac Mini M4 with 32 GB of unified memory (around $800) can run 30B parameter models that would require a $1,000+ GPU on a PC. The M4 Max and M4 Ultra chips with 64 to 192 GB of memory can run models that exceed what any single consumer GPU can handle.
The tradeoff is raw speed. Apple Silicon generates tokens somewhat slower than equivalent NVIDIA GPUs for models that fit entirely in VRAM, typically 12 to 25 tokens per second versus 30 to 60 tokens per second. But for models that would not fit in a consumer GPU at all, Apple Silicon provides access that would otherwise require professional or multi-GPU setups.
Storage Considerations
Model files are large and benefit significantly from SSD storage. A quantized 7B model is approximately 4 to 5 GB. A 13B model is 7 to 9 GB. A 70B model is 35 to 40 GB. If you want to keep multiple models downloaded and ready to switch between, plan for at least 50 GB of dedicated storage, and more is better.
SSDs are strongly recommended over mechanical hard drives. Loading a model from SSD takes 5 to 15 seconds; loading from a hard drive can take 1 to 3 minutes. Since Ollama automatically unloads inactive models to free memory and reloads them when needed, faster storage directly improves your workflow.
Software Requirements
The software requirements are simple. Ollama is the primary tool most people use to run local AI models. It is available for macOS, Linux, and Windows, and installs with a single command or installer download. Ollama handles model downloading, GPU detection, memory management, and API serving automatically.
Optionally, Open WebUI provides a browser-based chat interface that runs alongside Ollama. It requires Docker or Python 3.11. For the smoothest experience, Docker is the recommended installation method for Open WebUI since it handles all dependencies automatically.
No special drivers are needed beyond your operating system defaults, with one exception: NVIDIA GPU users should ensure they have recent NVIDIA drivers installed. On Windows, these come through Windows Update or the NVIDIA website. On Linux, the NVIDIA driver package from your distribution repository is sufficient.
Recommended Configurations by Budget
Free (use what you have): Any computer with 8+ GB of RAM and a modern processor. Install Ollama, run a small model like Phi-4 Mini or Qwen 3 0.6B, and see what local AI can do. This costs nothing and takes about ten minutes.
Budget ($0 to $300): Your existing computer plus a used NVIDIA RTX 3060 (12 GB VRAM, $250 to $300). This is the upgrade that makes the biggest difference. An 8B model on this GPU generates 30 to 60 tokens per second, transforming local AI from usable to excellent.
Mid-range ($800 to $1,200): A Mac Mini M4 with 32 GB unified memory ($800) or a PC with 32 GB RAM and an RTX 4060 Ti 16 GB ($1,000 to $1,200 total). Either configuration runs 13B to 30B models well, opening access to significantly more capable AI.
High-end ($2,000+): A Mac Studio with 64+ GB unified memory or a PC with an RTX 4090 (24 GB VRAM) and 64 GB of RAM. This tier runs the largest open-source models at usable speeds, rivaling cloud AI quality for most tasks.
Total Cost Breakdown
The total cost of getting started with local AI ranges from zero (using hardware you already own) to a few hundred dollars if you add a GPU. A computer with 16 GB of RAM and no discrete GPU runs 8B models on CPU at usable speeds with no additional investment. Adding a used NVIDIA RTX 3060 for $250 to $300 transforms performance dramatically and is the single best upgrade for local AI.
All the software is free: Ollama is open source, Open WebUI is open source, and the models themselves are freely downloadable. There are no licenses, subscriptions, or per-token fees. The only ongoing cost is electricity, which amounts to a few dollars per month even with heavy daily use. Compared to cloud AI subscriptions at $20 to $200 per month, local AI pays for any hardware investment quickly.
You probably already own hardware that can run AI locally. Start with what you have, install Ollama, and try a small model. If you want better speed, a used GPU is the most cost-effective upgrade.