How to Spec a Server for AI Agents
The process of selecting AI server components is sequential because each choice constrains the next. GPU selection depends on VRAM requirements. CPU selection depends on how many GPUs you use and whether you need CPU offloading. RAM depends on GPU VRAM and offloading needs. Storage and power supply depend on everything above. Following these steps in order prevents the common mistake of buying components that do not work well together.
Step 1: Define Your Model Requirements
Start by answering three questions. First, what is the largest model you need to run? If you plan to use 7B parameter models exclusively, your hardware needs are modest. If you need 70B models, you are in a different category entirely. Be specific about model families: Llama 3 70B, Qwen 2.5 72B, and Mixtral 8x7B have different memory footprints despite similar parameter counts.
Second, what precision format will you use? FP16 (full precision) requires 2 bytes per parameter. Q8 (8-bit quantization) requires roughly 1 byte per parameter. Q4 (4-bit quantization) requires approximately 0.5 bytes per parameter. The quality difference between FP16 and Q8 is negligible for most applications. Q4 shows slight quality degradation on reasoning tasks but remains excellent for conversational and coding use cases.
Third, how many concurrent users or agents will access the model simultaneously? A single user needs one inference stream. Multiple concurrent users require the model to handle batched requests, which increases KV-cache memory requirements proportionally. Each concurrent user adds roughly 0.5 to 2 GB of KV-cache depending on context length and model architecture.
Step 2: Calculate Your VRAM Requirement
Use this formula: VRAM needed = (parameters in billions x bytes per parameter at target precision) + (overhead for KV-cache and runtime). The overhead is typically 20 to 30 percent of the model weight size for a single user, increasing with concurrent users.
Examples: A 7B model at Q4 needs approximately 3.5 GB for weights + 1 GB overhead = 4.5 GB minimum VRAM. A 7B model at Q8 needs 7 GB + 1.5 GB = 8.5 GB. A 13B model at Q4 needs 6.5 GB + 1.5 GB = 8 GB. A 70B model at Q4 needs 35 GB + 7 GB = 42 GB. A 70B model at FP16 needs 140 GB + 20 GB = 160 GB.
For multi-user serving, add approximately 1 GB per concurrent user for 7B models and 2 GB per user for 70B models to account for KV-cache. A 7B Q4 model serving 4 users simultaneously needs roughly 4.5 + 4 = 8.5 GB of VRAM.
Write down your calculated VRAM number. This is the single most important specification that drives all subsequent decisions.
Step 3: Select Your GPU
Match your VRAM requirement to available GPUs. The options cluster into three tiers:
Consumer GPUs (8 to 32 GB VRAM): RTX 3060 12 GB ($180 used), RTX 3090 24 GB ($700 used), RTX 4090 24 GB ($1,600 new), RTX 5090 32 GB ($2,000 new). Choose the least expensive option that meets your VRAM requirement with at least 2 GB of headroom.
Multi-GPU consumer (24 to 64 GB combined): Two RTX 3090s provide 48 GB for approximately $1,500 used. This requires a motherboard with two x16 slots and a PSU rated at 1000W or higher. Multi-GPU adds complexity but is cost-effective for 40 to 48 GB VRAM needs.
Professional GPUs (40 to 192 GB VRAM): A100 40 GB ($5,000 used), A100 80 GB ($10,000 used), H100 80 GB ($25,000+ new), MI300X 192 GB ($15,000+ new). These make sense only when the VRAM requirement exceeds what consumer multi-GPU can provide, or when NVLink interconnect is needed for training workloads.
Step 4: Match Your CPU and RAM
For single-GPU inference serving, a modern 6 to 8 core processor is sufficient. The AMD Ryzen 5 5600 or Ryzen 7 5700X on AM4, or Ryzen 5 7600X or Ryzen 7 7700X on AM5, all work well. Choose AM5 if you want DDR5 memory bandwidth. Choose AM4 for lower total platform cost with DDR4.
For multi-GPU configurations, ensure the CPU and platform provide enough PCIe lanes. Two GPUs each need an x16 slot operating at least at x8 electrical bandwidth. Consumer AMD AM5 platforms support this with a Ryzen 7 or higher. For four or more GPUs, move to AMD Threadripper or EPYC platforms with 64 to 128 PCIe lanes.
System RAM should be at least twice your total GPU VRAM. For a 24 GB GPU, use 48 GB or preferably 64 GB of system RAM. For dual 24 GB GPUs (48 GB total), use 96 GB or 128 GB. If you plan to use CPU offloading for oversized models, add the expected offloaded model weight to this calculation. Running a 70B Q4 model (35 GB) with 24 GB of VRAM means offloading roughly 11 GB to system RAM, so plan for at least 64 GB.
Step 5: Choose Storage and Power Supply
Storage: Allocate at least 1 TB of NVMe SSD for the primary drive. This holds the OS, Docker images, inference frameworks, and your most-used models. If you work with multiple model families and quantization variants, add a 2 TB or larger secondary drive. NVMe is preferred for the primary drive, SATA SSD is acceptable for secondary storage.
Power supply: Add up the maximum power draw of all components. GPU TDP (350W for RTX 3090, 450W for RTX 4090) plus CPU TDP (65 to 105W for desktop processors) plus 50 to 100W for drives, fans, and motherboard. Multiply the total by 1.2 to get your minimum PSU wattage. For a single RTX 4090 system, this is roughly (450 + 105 + 75) x 1.2 = 756W, rounding up to an 850W unit. For dual RTX 3090s, it is (700 + 105 + 100) x 1.2 = 1,086W, requiring a 1200W PSU.
Choose 80+ Gold or better efficiency rating for any PSU above 750W. The efficiency difference between Bronze and Gold saves $5 to $15 per month in electricity at high sustained loads, which adds up over the life of the server.
Step 6: Validate the Configuration
Before purchasing, verify three things. First, physical compatibility: check that the GPU fits in the case (measure GPU length versus case clearance), the motherboard has the right socket and slots, and the PSU has the required PCIe power connectors (RTX 4090 models use a 16-pin 12VHPWR connector or require an adapter from dual 8-pin connectors).
Second, memory compatibility: verify that the motherboard supports the speed and capacity of your chosen RAM modules. DDR5-5600 may require a BIOS update on some AM5 boards. DDR4-3200 is universally supported on AM4 B550 and X570 boards.
Third, cooling sufficiency: a high-TDP GPU in a case with poor airflow will thermal throttle, reducing inference performance. At minimum, use a case with two intake fans and one exhaust fan. For RTX 3090 or 4090 builds, a case with three intake fans and two exhaust fans is recommended. CPU cooling should be a tower air cooler (such as the Thermalright Peerless Assassin at $35) or a 240mm AIO liquid cooler.
Start with your model requirements, calculate VRAM needs, and work outward from the GPU to CPU, RAM, storage, and power. Every component should be sized to support the GPU without creating bottlenecks. Validate physical fit, power budget, and cooling before purchasing.