RAM Requirements for Running AI Models Locally

Updated May 2026

A quantized 7B parameter model needs approximately 5 to 6 GB of RAM, a 13B model needs 8 to 10 GB, a 32B model needs 18 to 22 GB, and a 70B model needs 35 to 45 GB. Your total system RAM should exceed these numbers by at least 4 to 6 GB to leave room for your operating system and other applications. This guide provides exact requirements for every common model size and quantization level.

How Model Size Translates to Memory Usage

Language model memory usage is determined by two factors: the number of parameters and the precision at which those parameters are stored. A model parameter is a single number (a weight), and at full 16-bit precision (FP16), each parameter requires 2 bytes of memory. So a 7 billion parameter model at full precision needs 14 GB of memory just for the weights, plus additional memory for the context window and processing overhead.

Quantization reduces memory requirements by storing parameters at lower precision. The most common quantization format for local AI is Q4_K_M (4-bit with mixed precision), which reduces memory per parameter from 2 bytes to approximately 0.55 to 0.65 bytes. This means a 7B model at Q4_K_M uses roughly 4 to 5 GB for the weights, down from 14 GB at full precision. The quality reduction from Q4 quantization is surprisingly small, typically less than 5% on benchmarks.

Beyond the model weights, inference requires additional memory for the KV cache (which stores the context of the conversation) and processing buffers. The KV cache size depends on the context length and model architecture. For a typical 8K context window, expect an additional 0.5 to 1.5 GB on top of the model weights. Longer context windows (32K, 128K) use proportionally more memory.

Memory Requirements by Model Size

These numbers reflect the total system memory used during inference at Q4_K_M quantization (the Ollama default), including model weights, KV cache for an 8K context window, and processing overhead.

1B to 3B models (Phi-4 Mini, Gemma 2B, Qwen 3 1.7B): 1.5 to 3 GB total. Runs comfortably on any machine with 8 GB of system RAM. These models leave plenty of memory for your operating system, browser, and other applications.

7B to 8B models (Qwen 3 8B, Llama 3.3 8B, Mistral Small 3): 5 to 6 GB total. Requires at least 8 GB of system RAM (tight, with limited room for other applications) or 16 GB for comfortable use. This is the most popular model tier for local AI.

13B to 14B models (Qwen 3 14B, various 13B fine-tunes): 8 to 10 GB total. Requires 16 GB of system RAM minimum, 32 GB recommended for comfortable multitasking alongside the model.

30B to 32B models (QwQ 32B, Qwen 3 32B): 18 to 22 GB total. Requires 32 GB of system RAM minimum. On GPU, needs at least 24 GB of VRAM for full GPU acceleration.

65B to 70B models (Llama 3.3 70B, Qwen 3 72B): 35 to 45 GB total. Requires 48 to 64 GB of system RAM. On GPU, needs 40+ GB of VRAM (professional-grade cards or multiple GPUs). Apple Silicon Macs with 64+ GB unified memory handle this tier well.

Quantization Levels Explained

Ollama supports multiple quantization levels, each trading quality for memory savings. Understanding these helps you choose the right balance for your hardware.

Q8 (8-bit): Nearly identical quality to full precision, approximately 1 byte per parameter. A 7B model uses about 7 to 8 GB. Use this if you have ample memory and want the best possible quality.

Q6_K (6-bit): Very minor quality reduction, approximately 0.8 bytes per parameter. A good middle ground between Q8 and Q4 for users with moderate memory.

Q4_K_M (4-bit mixed, default): The standard recommended quantization. Less than 5% quality reduction on most benchmarks, approximately 0.55 to 0.65 bytes per parameter. This is what Ollama uses by default and what most local AI users run.

Q3_K and Q2_K (3-bit, 2-bit): Noticeable quality degradation, especially Q2. Use only when you absolutely need to fit a larger model into limited memory and accept the quality tradeoff. A Q2-quantized 13B model uses less memory than a Q4 7B model but typically produces lower quality output.

VRAM vs System RAM for GPU Users

If you have a dedicated GPU, the model ideally fits entirely in VRAM for maximum speed. The VRAM requirements match the model memory requirements listed above: a Q4-quantized 8B model needs approximately 5 to 6 GB of VRAM. If your GPU has exactly that amount or slightly more, the model fits and you get full GPU acceleration.

When the model exceeds your VRAM, Ollama splits it between GPU and CPU automatically. The layers in GPU memory process at GPU speed, while the layers in system RAM process at CPU speed. This hybrid approach provides partial acceleration, with the speedup proportional to how much of the model fits in VRAM. Even having half the model on GPU provides a noticeable improvement over pure CPU.

Apple Silicon Macs handle this differently. Their unified memory architecture means the GPU can access all system memory, so a Mac with 32 GB of unified memory can load a 30 GB model entirely into GPU-accessible memory. This is why Apple Silicon is so popular for local AI despite generating tokens slower per unit of VRAM than dedicated NVIDIA cards.

Context Window Memory Overhead

Beyond the model weights themselves, the context window (conversation history) consumes additional RAM. Every token in the conversation context requires memory for the key-value cache, which grows linearly with context length and model size. For an 8B model with the default 8192 token context, this adds roughly 500 MB to 1 GB of memory usage. Extending the context to 32K or 128K tokens can add 2 to 8 GB, which is significant when you are already close to your memory limit.

Practical Memory Management Tips

Close memory-intensive applications before running large models. A web browser with many tabs can use 2 to 4 GB of RAM, which directly competes with model memory. If you run into issues loading a model, check your system memory usage first.

Ollama keeps loaded models in memory for 5 minutes by default after the last request, then unloads them. You can adjust this timeout or manually unload models with ollama stop modelname to reclaim memory immediately.

If a model barely fits in your memory, consider using a lower quantization level. Dropping from Q4 to Q3 reduces memory usage by roughly 20% with a modest quality decrease. This can be the difference between a model loading successfully or failing with out-of-memory errors.

Reducing context length also saves memory. If you do not need long conversation history, configuring a shorter context window (2048 or 4096 tokens instead of the default 8192) can free up 0.5 to 1 GB of memory, enough to make a tight fit work.

Swap and Virtual Memory: Why They Do Not Help

When a model exceeds available RAM, your operating system can use swap space (disk-based virtual memory) to prevent crashes. However, this does not provide a usable experience for AI inference. Disk-based swap is 100 to 1,000 times slower than RAM, meaning token generation slows to a crawl. A model that generates 10 tokens per second in RAM might produce one token every few seconds from swap, making interactive use impractical.

Some users wonder if fast NVMe SSDs make swap viable for AI. While NVMe drives are much faster than traditional hard drives, they are still orders of magnitude slower than RAM. NVMe read speeds top out around 7 GB/s for the fastest drives, compared to 50 to 100 GB/s for DDR5 system RAM. AI inference constantly reads through the entire model for every single token generated, so this bandwidth gap directly impacts performance.

macOS handles memory pressure more gracefully than most operating systems through its memory compression technology, which can squeeze more effective capacity from physical RAM. This helps when running models that are close to your RAM limit but does not fundamentally change the physics: if the model significantly exceeds your available memory, performance degrades regardless of the operating system.

The practical advice is simple: if a model does not fit comfortably in your available RAM with room left for the operating system and other applications, choose a smaller model or a lower quantization level rather than relying on swap.

Key Takeaway

A 7B to 8B model needs 5 to 6 GB and runs well on 16 GB of system RAM. A 13B model needs 16 GB minimum. For 30B+ models, you need 32+ GB. Quantization (Q4_K_M default) reduces memory by 75% compared to full precision with minimal quality loss.

How Model Size Translates to Memory Usage

Memory Requirements by Model Size

Quantization Levels Explained

VRAM vs System RAM for GPU Users

Context Window Memory Overhead

Practical Memory Management Tips

Swap and Virtual Memory: Why They Do Not Help

Related Articles

GPU vs CPU for Local AI: What You Need

What You Need to Run AI Locally

Best Local AI Models for Different Tasks

Can My Computer Run AI Models Locally