How to Run AI Locally on a Mac
Macs with Apple Silicon (M1, M2, M3, M4, and their Pro, Max, and Ultra variants) are uniquely well-suited for local AI. The unified memory architecture, where CPU and GPU share the same pool of high-bandwidth memory, means the GPU can access all available RAM for model inference. On a PC, a dedicated GPU can only use its own VRAM (typically 8 to 24 GB), but an Apple Silicon GPU can use the full 16, 32, 64, or even 192 GB of system memory. This architectural advantage makes Macs capable of running much larger models than similarly priced PCs with dedicated GPUs.
Check Your Mac Hardware
Click the Apple menu in the top-left corner and select "About This Mac." Check two things: your chip type and your memory amount.
Chip: You need an Apple Silicon chip (M1, M2, M3, M4, or any Pro/Max/Ultra variant). Intel Macs can run Ollama but only on CPU with no GPU acceleration, which is significantly slower. If you have an Intel Mac, local AI will work but expect lower performance.
Memory: 8 GB is the minimum for small models (1B to 3B parameters). 16 GB handles 8B models comfortably and is the recommended starting point. 32 GB opens up 30B+ models. 64 GB or more lets you run the largest available models including 70B parameter models. The memory shown here is your unified memory, which serves as both system RAM and GPU memory.
Install Ollama for macOS
Go to ollama.com and click the download button for macOS. Open the downloaded .dmg file and drag the Ollama application to your Applications folder. Launch Ollama from Applications. On first launch, it installs a command-line tool and starts a background service. You will see a small llama icon in your menu bar indicating the service is running.
Verify the installation by opening Terminal (found in Applications, then Utilities, or search for it with Spotlight) and typing ollama --version. You should see a version number confirming the installation succeeded.
Ollama automatically detects Apple Silicon and enables Metal GPU acceleration. No additional driver installation or configuration is needed. This is one of the advantages of the Apple ecosystem for local AI: GPU acceleration works immediately without fiddling with CUDA drivers or ROCm.
Download a Model Sized for Your RAM
Choose a model that fits comfortably within your available memory. Leave at least 4 to 6 GB free for macOS and other applications. Here are recommendations by memory tier:
8 GB Mac: Run ollama run phi4-mini or ollama run qwen3:1.7b. These small models use 1.5 to 2.5 GB and leave plenty of room for macOS. Performance is snappy even on base M1 chips.
16 GB Mac: Run ollama run qwen3:8b. This is the sweet spot for most Mac users, providing strong general-purpose AI with comfortable memory headroom. The model uses about 5 to 6 GB, leaving 10 GB for everything else.
32 GB Mac: Run ollama run qwen3:32b or ollama run qwq:32b. The 32B tier is where local models start rivaling cloud services for many tasks. These models use 18 to 22 GB, which fits well with 32 GB total.
64+ GB Mac: Run ollama run llama3.3:70b or ollama run qwen3:72b. The largest locally-available models run well on high-memory Macs. Quality approaches cloud frontier models for most tasks.
Verify Metal GPU Acceleration
While a model is loaded, open a second Terminal window and run ollama ps. The output shows the model name, memory usage, and processor type. On Apple Silicon, you should see the model running on "GPU" with 100% of layers offloaded. If you see "CPU," something is wrong with the Metal backend, which is rare but can happen after macOS updates.
Typical performance on Apple Silicon: an M1 with 16 GB generates 12 to 20 tokens per second with an 8B model. An M3 Pro or M4 with 32 GB generates 15 to 30 tokens per second. M2 Max, M3 Max, or M4 Max chips with higher memory bandwidth push 25 to 40 tokens per second. These speeds feel responsive and natural for interactive chat.
Add a Chat Interface
While the Terminal chat works fine, most Mac users prefer a visual interface. Two good options are available:
Open WebUI (recommended): If you have Docker Desktop installed, run the Open WebUI Docker container. It connects to your native Ollama instance automatically. Important: run Ollama natively (not in Docker) on Mac, because Docker on macOS does not have GPU passthrough to Metal. Only the web interface runs in Docker.
Open WebUI Desktop App: If you do not want to use Docker, download the Open WebUI desktop app, which runs natively and connects to your Ollama instance without any container overhead.
Either option gives you conversation history, model switching, file uploads, and a polished chat experience that feels comparable to cloud services like ChatGPT.
Mac-Specific Tips and Considerations
Memory pressure: macOS uses memory compression and swap to handle memory pressure, but relying on swap for AI inference dramatically slows performance. Open Activity Monitor (Applications, then Utilities) and check the Memory tab while running a model. If you see significant swap usage (shown in the "Swap Used" field), the model is too large for comfortable use on your machine. Consider a smaller model or closing other applications.
Memory bandwidth matters: The Pro, Max, and Ultra variants of Apple Silicon chips have significantly higher memory bandwidth than the base chips. For AI inference, memory bandwidth directly affects token generation speed. An M4 Max generates tokens roughly twice as fast as an M4, even with the same model, because it can read the model weights from memory faster. If you are buying a Mac specifically for local AI, the higher-bandwidth chip variants provide a measurable performance improvement.
Ollama auto-start: Ollama launches automatically when you log in if it is in your Login Items. You can manage this in System Settings under General, then Login Items. Most users want Ollama running continuously so it is ready when needed.
Energy considerations: AI inference uses significant power on a laptop. When running on battery, token generation speed may decrease as macOS throttles performance to conserve energy. Plug in your MacBook when running large models for consistent performance.
Model compatibility: Every model available through Ollama works on Apple Silicon. There are no Mac-specific compatibility issues with model formats or architectures. The GGUF format used by Ollama runs identically on Apple Silicon, NVIDIA GPUs, and AMD GPUs. If a model works on one platform, it works on all of them with the same output quality.
Multiple models simultaneously: Macs with sufficient unified memory can keep multiple models loaded at the same time. A 32 GB Mac could keep an 8B general-purpose model and a small 3B coding model both in memory, ready to respond instantly without reloading. This is practical for workflows where you frequently switch between different model types.
Apple Silicon Macs are excellent for local AI because unified memory lets the GPU use all available RAM. Install Ollama, download a model sized for your memory, and you get GPU-accelerated local AI with zero driver configuration.