Hardware Guide for Self-Hosted LLMs
The Memory Rule
The fundamental constraint for LLM inference is memory, not compute. The entire model (or at least the layers being processed) must fit in memory before a single token can be generated. The formula is straightforward: model parameters multiplied by bytes per parameter at your chosen precision, plus 10-20% overhead for the KV cache and runtime.
At full 16-bit precision, each parameter needs 2 bytes. At 8-bit, 1 byte. At 4-bit, 0.5 bytes. So a 7B parameter model needs roughly 14GB at FP16, 7GB at INT8, or 3.5GB at INT4. A 70B model needs 140GB, 70GB, or 35GB respectively. These numbers give you the minimum memory requirement; add 10-20% for practical operation.
Tier 1: Existing Hardware (No Purchase Required)
Most developers already own hardware capable of running useful LLMs. Any modern laptop or desktop with 16GB of RAM can run 7-8B parameter models using CPU inference via Ollama. Performance is slower than GPU inference (typically 5-15 tokens per second), but adequate for personal use and development. Models like Llama 3.2 3B, Phi-3 Mini, and Gemma 2B run even on 8GB machines.
If your machine has an NVIDIA GPU with 6-8GB of VRAM (common in gaming laptops and mid-range desktops), you can accelerate inference significantly. An RTX 3060 with 12GB of VRAM handles 7B Q4 models at 30-50 tokens per second, a 3-10x speedup over CPU.
Tier 2: Consumer GPU ($500-2,000)
NVIDIA RTX 4060 Ti 16GB (~$400-500): The entry point for serious local LLM work. 16GB of VRAM handles 7-13B models at various quantization levels. Good for personal coding assistants and local chatbots.
NVIDIA RTX 4090 24GB (~$1,600-2,000): The best consumer GPU for LLMs. 24GB of VRAM fits 13B models at full precision or 30B models at Q4 quantization. Inference speed is excellent, with 70-100+ tokens per second for smaller models. This is the recommended GPU for developers and small teams who want a single-GPU solution.
AMD GPUs: AMD Radeon RX 7900 XTX (24GB) offers comparable VRAM at lower cost, but software support lags NVIDIA. ROCm support in llama.cpp and Ollama has improved substantially in 2026, making AMD a viable option, though NVIDIA remains the path of least resistance.
Tier 3: Apple Silicon ($1,500-8,000)
Apple Silicon Macs offer a unique advantage for LLM inference: unified memory. Because the CPU and GPU share the same memory pool, a Mac with 64GB of unified memory can load models that would require a 64GB GPU on other platforms. No consumer NVIDIA GPU offers that much VRAM.
MacBook Pro M3 Pro 36GB (~$2,500): Runs 30B Q4 models comfortably. Good for development with medium-sized models.
Mac Studio M2 Ultra 192GB (~$6,000-8,000): The maximum Apple Silicon configuration. Can run 70B full-precision models or even Llama 4 Scout (109B total parameters) with room for context. Inference speed on Apple Silicon is lower than dedicated NVIDIA GPUs token-for-token, but the memory capacity enables model sizes that would otherwise require server hardware.
Mac Mini M4 Pro 48GB (~$1,800): An excellent value option. 48GB of unified memory handles 30B models at full precision or 70B at Q4. The compact form factor makes it practical as a dedicated inference server.
Tier 4: Server GPUs ($5,000-50,000+)
NVIDIA A100 80GB: The previous generation workhorse for production LLM serving. A single A100 runs 70B models at full precision. Two A100s with NVLink handle most 100B+ parameter models. Available used at $5,000-8,000 per GPU, making it increasingly attractive on the secondary market.
NVIDIA H100 80GB: The current standard for production deployments. Roughly 2-3x faster than A100 for LLM inference, with the same 80GB of VRAM. New pricing starts around $25,000-30,000 per GPU, with cloud rental available at $2-4 per GPU-hour.
NVIDIA H200 141GB: An H100 variant with 141GB of HBM3e memory. The additional memory is valuable for running larger models without quantization or for handling very long context windows. Pricing is higher than H100 but the extra memory can eliminate the need for multi-GPU configurations.
Multi-GPU Configurations
For models that exceed a single GPU memory capacity, tensor parallelism splits the model across multiple GPUs. vLLM and other production servers handle this automatically. Two H100 GPUs (160GB total) run Mistral Medium 3.5 (128B parameters). Four H100s (320GB total) handle Llama 4 Maverick (400B total parameters) with room for long context.
The GPUs must be connected via high-bandwidth interconnect (NVLink or NVSwitch) for efficient tensor parallelism. PCIe connections work but introduce significant inter-GPU communication overhead that reduces throughput.
CPU-Only Inference
CPU inference is often overlooked but remains practical for certain use cases. Modern server CPUs with large amounts of DDR5 RAM (256-512GB) can run the largest models at modest speed. A server with 512GB of RAM can load and serve a full-precision 70B model without any GPU. Throughput is lower (5-20 tokens per second), but for batch processing, offline analysis, or low-frequency applications, the economics can work since high-RAM servers are far cheaper than GPU servers.
Start with what you have. Any 16GB laptop runs useful small models. When you are ready to invest, the RTX 4090 (24GB) is the best consumer option, and Apple Silicon Macs offer unmatched memory capacity per dollar. For production workloads, H100 GPUs remain the standard.