What GPU Do I Need for AI Agents

Updated May 2026

For running 7B parameter AI models, you need a GPU with at least 8 GB of VRAM, such as the NVIDIA RTX 3060 12 GB ($180 used) or GTX 1070 8 GB ($120 used). For 13B to 30B models, you need 24 GB of VRAM from an RTX 3090 ($700 used) or RTX 4090 ($1,600 new). For 70B models, you need 40 GB or more via professional cards like the A100 or multi-GPU consumer setups.

The Answer Depends on Your Model Size

GPU selection for AI agents is driven almost entirely by one specification: VRAM (Video RAM) capacity. VRAM determines which models fit in GPU memory, and models that fit in VRAM run 5x to 20x faster than models that spill to system RAM. The most common mistake is buying a GPU with fast compute but insufficient VRAM, only to discover that the target model cannot load entirely on the card.

The VRAM rule of thumb is simple: at Q4 quantization (the most common format for local inference), a model needs approximately 0.5 GB per billion parameters for weights, plus 1 to 3 GB of overhead for KV-cache and runtime. A 7B model needs about 5 GB total. A 13B model needs about 8 GB. A 30B model needs about 18 GB. A 70B model needs about 38 GB.

What GPU do I need for 7B models like Llama 3 8B or Mistral 7B?

Any GPU with 8 GB or more of VRAM handles 7B models at Q4 quantization. The most cost-effective options are the NVIDIA GTX 1070 8 GB ($100 to $150 used) for the absolute minimum, the RTX 3060 12 GB ($180 to $250 used) for the best balance of price and capability, and the RTX 4060 Ti 16 GB ($450 new) for the newest architecture with extra VRAM headroom. Even integrated GPUs on Apple Silicon Macs with 16 GB or more of unified memory handle 7B models, though at lower speed.

What GPU do I need for 13B models like Llama 3 13B?

13B models at Q4 need about 8 GB of VRAM, so a 12 GB card like the RTX 3060 handles them well. At Q8 quantization (higher quality), 13B models need about 14 to 16 GB, requiring a 16 GB card (RTX 4060 Ti 16 GB) or a 24 GB card (RTX 3090 or RTX 4090). The RTX 3090 at $700 used is the recommended choice because it provides ample VRAM with fast bandwidth for comfortable inference speed.

What GPU do I need for 30B to 34B models like Qwen 2.5 32B?

30B to 34B models at Q4 need about 17 to 20 GB of VRAM. A 24 GB GPU (RTX 3090 or RTX 4090) handles these with room for KV-cache. At Q8, these models need 32 to 36 GB, which requires the RTX 5090 (32 GB) or multi-GPU configurations. The RTX 4090 with 24 GB is the practical choice for Q4 inference on 30B models, delivering 10 to 18 tokens per second.

What GPU do I need for 70B models like Llama 3 70B?

70B models at Q4 need about 38 to 42 GB of VRAM. No single consumer GPU provides this. Options include: dual RTX 3090s ($1,500 used, 48 GB combined), the NVIDIA A100 80 GB ($10,000 used), or an Apple Mac Studio with M4 Ultra and 192 GB unified memory ($4,000 to $7,000). For budget-conscious users, CPU offloading with a single RTX 4090 and 128 GB of system RAM can run 70B Q4 models at 4 to 8 tokens per second, which is usable for personal interaction but too slow for multi-user serving.

NVIDIA vs AMD vs Apple Silicon

NVIDIA is the default recommendation for AI GPUs. The CUDA software ecosystem, Tensor Core acceleration, and near-universal framework compatibility make NVIDIA GPUs work out of the box with every major AI tool. If you want the simplest setup experience and broadest compatibility, choose NVIDIA.

AMD GPUs offer more VRAM per dollar in some cases (the RX 7900 XTX provides 24 GB for $600 to $700 versus $1,600 for the RTX 4090 with the same VRAM), but the ROCm software ecosystem requires more setup effort and has occasional compatibility gaps with specific frameworks and libraries. AMD is a good choice for experienced Linux users who value hardware cost savings.

Apple Silicon Macs use unified memory where system RAM serves as VRAM. An M4 Max MacBook Pro with 128 GB of unified memory can run 70B models that would require expensive discrete GPUs on PC. The trade-off is lower inference speed: Apple Silicon produces roughly one-third to one-fifth the tokens per second of an equivalent NVIDIA GPU due to lower memory bandwidth. Choose Apple Silicon when you need to run very large models on a single machine with minimal setup.

Budget-Based Recommendations

Under $200: Used GTX 1070 8 GB ($120) or GTX 1080 Ti 11 GB ($180). Handles 7B models at Q4 and Q8. The 1080 Ti is worth the extra $60 for the additional 3 GB of VRAM.

$200 to $500: RTX 3060 12 GB ($200 used) or RTX 4060 Ti 16 GB ($450 new). Both handle 7B to 13B models comfortably. The 3060 offers better VRAM value, the 4060 Ti offers newer Tensor Cores and lower power consumption.

$500 to $1,000: RTX 3090 24 GB ($700 to $800 used). This is the single best value GPU for AI. The 24 GB of VRAM handles 13B Q8, 30B Q4, and even 70B Q4 with partial offloading. Nothing else in this price range comes close.

$1,000 to $2,000: RTX 4090 24 GB ($1,600 new) or RTX 5090 32 GB ($2,000 new). The 4090 delivers 1.5x the inference speed of the 3090 with the same VRAM. The 5090 adds 8 GB of VRAM (32 GB total), enabling 30B Q8 models on a single card.

Over $2,000: Dual RTX 3090s ($1,500 used, 48 GB), A100 80 GB ($10,000 used), or cloud instances for occasional high-end workloads. At this level, evaluate whether building or renting makes more financial sense for your usage patterns.

Do Not Overbuy

The most common mistake in GPU selection is buying more GPU than you need based on aspirational rather than actual workloads. If you currently use 7B models and might someday want to try 70B models, buy for your current workload and rent cloud instances for the occasional experiment. An RTX 3060 12 GB at $200 handles your daily 7B workload, and a cloud A100 instance at $1.50 per hour covers the rare 70B experiment for far less than the cost of hardware that sits underutilized.

Start with the GPU that matches your current needs, validate that local inference meets your requirements, and upgrade when your actual workload demands it. The used GPU market makes upgrading straightforward: sell your current card and put the proceeds toward the next tier.

Key Takeaway

Match your GPU VRAM to your target model size: 8 GB minimum for 7B models, 12 to 16 GB for 13B models, 24 GB for 30B models, and 40 GB or more for 70B models. The NVIDIA RTX 3090 24 GB at $700 used is the best overall value for most AI workloads.

The Answer Depends on Your Model Size

NVIDIA vs AMD vs Apple Silicon

Budget-Based Recommendations

Do Not Overbuy

Related Questions

Can You Run AI Agents Without a GPU

NVIDIA GPU Guide for AI Workloads

GPU Requirements for AI Workloads

AMD GPUs for AI: What Works