GPU Requirements for AI Workloads
Why GPUs Dominate AI Compute
Neural networks are fundamentally built on matrix multiplication, an operation that benefits enormously from parallel processing. A modern GPU contains thousands of CUDA cores (NVIDIA) or stream processors (AMD) that can perform these operations simultaneously. The NVIDIA RTX 4090 has 16,384 CUDA cores compared to the 24 cores in a high-end desktop CPU. This massive parallelism is why a GPU can be 10x to 100x faster than a CPU for inference on the same model.
Beyond raw core count, GPUs include specialized hardware for AI operations. NVIDIA Tensor Cores accelerate mixed-precision matrix operations used heavily in transformer models. The RTX 40-series includes fourth-generation Tensor Cores that support FP8 operations, doubling throughput compared to FP16. AMD CDNA architecture includes Matrix Cores with similar capabilities. These hardware accelerators mean that GPU generation matters as much as raw specifications.
VRAM: The Defining Specification
VRAM (Video Random Access Memory) is the GPU equivalent of system RAM, and it is the single most important factor in determining which AI models you can run. The model weights must fit in VRAM (or be split across multiple GPUs) for inference to proceed at GPU speed. If weights spill to system RAM, the portions running on CPU are dramatically slower, reducing overall throughput by 5x to 20x.
The VRAM formula depends on the model precision format. At FP16 (16-bit), each parameter consumes 2 bytes. A 7B parameter model at FP16 needs 14 GB of VRAM just for the weights, plus 2 to 4 GB for the KV-cache and runtime overhead, totaling about 16 to 18 GB. At Q8 quantization (8-bit), the same model needs about 7 GB for weights plus overhead, fitting comfortably in a 12 GB card. At Q4 (4-bit), weights drop to roughly 3.5 GB, running on an 8 GB card with room to spare.
For larger models, the math scales linearly. A 70B model at FP16 needs 140 GB of VRAM, requiring multiple A100 80 GB cards. At Q4, it needs about 35 GB, fitting on a single A100 40 GB or requiring two consumer cards. A 13B model at Q8 needs about 13 GB of VRAM, fitting on a single RTX 4060 Ti 16 GB or RTX 3090 24 GB.
Memory Bandwidth and Inference Speed
During inference, the GPU reads model weights from VRAM for each token generated. The speed of this read operation, measured as memory bandwidth, directly determines tokens-per-second performance for large models. The relationship is roughly: maximum tokens per second equals memory bandwidth divided by model size in memory.
The RTX 4090 provides 1,008 GB/s of memory bandwidth. For a 7B model at Q4 (about 4 GB in memory), theoretical maximum throughput is roughly 250 tokens per second. Real-world performance is lower due to overhead, but 80 to 120 tokens per second is achievable. The H100 with 3,350 GB/s bandwidth pushes this to 300+ tokens per second for the same model.
For larger models, bandwidth matters even more. A 70B model at Q4 (about 35 GB) on an RTX 4090 yields roughly 30 tokens per second, which is comfortable for interactive use. On an H100, the same model produces 90+ tokens per second, enabling much higher concurrency.
Consumer GPU Options for AI
NVIDIA dominates the consumer AI GPU space. The RTX 5090 (32 GB GDDR7, 1,792 GB/s bandwidth) is the current flagship as of 2025-2026, offering the most VRAM in a consumer card. The RTX 4090 (24 GB GDDR6X, 1,008 GB/s) remains highly capable and increasingly available at reduced prices. The RTX 3090 (24 GB GDDR6X, 936 GB/s) is a popular used option at $600 to $800.
For budget builds, the RTX 3060 12 GB ($200 to $300 used) and RTX 4060 Ti 16 GB ($500 new) offer usable VRAM for 7B to 13B models at quantized precision. The RTX 3060 12 GB is particularly notable because 12 GB of VRAM is enough for most 7B models at Q8, making it the entry point for GPU-accelerated AI.
AMD consumer GPUs (RX 7900 XTX with 24 GB, RX 7900 XT with 20 GB) offer competitive VRAM amounts but trail NVIDIA in software ecosystem support. ROCm compatibility with AI frameworks has improved significantly but still lags behind CUDA in terms of plug-and-play reliability and community resources.
Professional and Data Center GPUs
For workloads that exceed consumer GPU capabilities, professional cards offer higher VRAM, faster memory (HBM vs GDDR), and features like NVLink for multi-GPU interconnect. The NVIDIA A100 is available in 40 GB and 80 GB variants with HBM2e memory providing 2 TB/s bandwidth. The H100 offers 80 GB of HBM3 at 3.35 TB/s. The newer H200 pushes to 141 GB of HBM3e at 4.8 TB/s.
AMD competes at the professional level with the Instinct MI300X, featuring 192 GB of HBM3 memory and 5.3 TB/s bandwidth. This is the highest single-GPU VRAM available, capable of running 70B parameter models at FP16 without quantization on a single card. The MI300X has gained adoption in cloud providers and research institutions.
Cost is the major factor with professional GPUs. A new H100 costs $25,000 to $35,000. A used A100 80 GB runs $8,000 to $12,000. An MI300X is approximately $15,000 to $20,000. These prices make sense for continuous production workloads but are difficult to justify for personal or experimental use.
Multi-GPU Configurations
When a single GPU lacks sufficient VRAM, multi-GPU setups allow models to be split across cards. Two RTX 3090s provide 48 GB of combined VRAM for roughly $1,500 to $1,800 in used hardware, enough for 70B models at aggressive quantization. The framework (vLLM, llama.cpp, or similar) handles the model splitting automatically.
Multi-GPU adds complexity in several areas. Power requirements roughly double (two RTX 4090s draw 900 watts). The motherboard and CPU need enough PCIe lanes for full bandwidth to each card. Physical space and cooling become challenging, as two triple-slot GPUs generate significant heat. Software configuration requires proper CUDA device ordering and may need NVLink for optimal inter-GPU communication on professional cards.
For consumer multi-GPU setups, communication between cards goes over the PCIe bus, which is slower than NVLink but adequate for inference. The performance overhead of PCIe-based multi-GPU inference is typically 10 to 20 percent compared to a single GPU with equivalent total VRAM. For training workloads with frequent gradient synchronization, this overhead increases significantly, making NVLink-equipped professional cards worth the premium.
Choose your GPU based on VRAM first, bandwidth second, and compute capability third. For most AI agent workloads, a single RTX 4090 (24 GB) or RTX 5090 (32 GB) handles 7B to 30B models comfortably. For 70B models, plan for 40+ GB via professional cards or multi-GPU consumer setups.