Requirements for Self-Hosting AI Agents
GPU Requirements: The Critical Factor
GPU VRAM (video memory) is the single most important hardware specification for self-hosted AI. Language models must be loaded entirely into GPU memory before they can generate text. The amount of VRAM you need directly determines which models you can run.
The general rule is 0.5 GB of VRAM per billion model parameters when using 4-bit quantization, which is the standard approach for self-hosted deployments. At full 16-bit precision, the requirement doubles to approximately 2 GB per billion parameters. Here is how common model sizes map to VRAM needs:
7B parameter models (Llama 3.3 8B, Qwen 2.5 7B, Mistral 7B): 4 to 5 GB VRAM at 4-bit quantization. These models fit comfortably on an 8 GB consumer GPU with room for the KV cache. They handle most general-purpose tasks well: document summarization, code generation, data extraction, and conversational AI.
13B to 14B parameter models (Qwen 2.5 14B, various fine-tunes): 7 to 8 GB VRAM at 4-bit. These require a minimum 8 GB GPU but perform better with 12 GB or more to allow adequate KV cache for longer conversations. They offer noticeably improved reasoning and instruction following over 7B models.
34B to 35B parameter models: 18 to 20 GB VRAM at 4-bit. These require a 24 GB GPU like the RTX 4090 or RTX A5000. They deliver substantial quality improvements, particularly in complex reasoning and nuanced writing tasks.
70B parameter models (Llama 3.3 70B, Qwen 2.5 72B): 35 to 40 GB VRAM at 4-bit. No single consumer GPU can run these. Options include dual consumer GPUs with model splitting, a single A100 80 GB, or an H100 80 GB. These models approach cloud API quality on most tasks.
For most organizations starting out, an NVIDIA RTX 4060 (8 GB, around $300) or RTX 4060 Ti 16 GB (around $450) provides a capable entry point. An RTX 4090 (24 GB, around $1,800) is the sweet spot for serious deployments, offering excellent performance across models up to 34B parameters.
CPU and System RAM
While the GPU handles model inference, the CPU manages everything else: agent orchestration logic, tool execution, API calls, database queries, file operations, and inter-process communication. The CPU needs to be fast enough that it does not become a bottleneck between GPU inference calls.
Minimum viable: A modern 4-core CPU (AMD Ryzen 5 or Intel i5 equivalent) with 16 GB RAM handles single-agent workloads with minimal tool usage. This configuration works for development and light production use.
Recommended: An 8-core CPU with 32 GB RAM supports multiple concurrent agent sessions, RAG pipelines with vector database queries, and moderate tool usage. This handles most production workloads for small to medium teams.
Production scale: A 16+ core CPU with 64 GB RAM serves heavy multi-agent workloads with extensive tool usage, large vector databases, and many concurrent users. At this tier, you may also want ECC (Error Correcting Code) memory for reliability in always-on deployments.
CPU architecture matters less than raw core count and clock speed for AI agent workloads. Both AMD and Intel current-generation processors perform well. For CPU-only inference without a GPU (running smaller models via llama.cpp), AVX2 and AVX-512 instruction support improves performance significantly.
Storage Requirements
Self-hosted AI systems need storage for model weights, vector database indices, conversation logs, and your document corpus for RAG pipelines.
Model weights are the largest storage consumers. A single 7B model at 4-bit quantization occupies approximately 4 GB. A 70B model takes about 35 to 40 GB. If you experiment with multiple models, which is common, allocate 100 to 200 GB for model storage alone.
Vector database indices grow with your document corpus. A typical RAG setup with a few thousand documents uses 1 to 10 GB. Large enterprise knowledge bases can reach 50 to 100 GB of index data.
Conversation logs and monitoring data accumulate over time. Agent traces, token usage logs, and performance metrics can consume 10 to 50 GB per month for active deployments.
Recommended storage: A 1 TB NVMe SSD provides comfortable headroom for most deployments. The SSD speed matters because model loading from disk to GPU memory benefits from fast sequential reads. If you plan to host many models or large document corpuses, allocate 2 TB. For archival conversation logs, a secondary HDD or network storage works fine since access speed is less critical.
Network Requirements
If your self-hosted setup runs entirely on-premise, network requirements are minimal. The agent communicates with local services over localhost or LAN connections. A standard gigabit ethernet connection handles any internal workload.
If you host on a VPS or remote server, your internet connection needs to handle model downloads (which can be tens of gigabytes per model) and ongoing user traffic. A 100 Mbps connection is sufficient for most deployments. Latency matters more than bandwidth for interactive applications, so choose a hosting provider geographically close to your users.
If your agents use tools that access external services (web browsing, API calls, email), those require outbound internet access. Consider running tools through a proxy or firewall to control which external services your agents can reach.
Software Prerequisites
The software stack for self-hosted AI agents has standardized around a few core components.
Operating system: Linux is the standard choice. Ubuntu 22.04 or 24.04 LTS are the most widely supported. NVIDIA GPU drivers work best on Linux, and Docker support is most mature. macOS works for development with Apple Silicon (M-series chips), which have unified memory that simplifies VRAM management. Windows works but is less commonly used for production deployments.
NVIDIA drivers and CUDA: Required for GPU inference. Install the latest NVIDIA driver for your GPU, then install the CUDA toolkit. Most inference engines handle CUDA integration automatically, but the driver must be installed at the OS level.
Docker and Docker Compose: The standard deployment tool for self-hosted AI. Most platforms (Dify, Flowise, n8n) provide Docker Compose configurations. The NVIDIA Container Toolkit enables GPU access from within Docker containers. Docker simplifies deployment, updates, and isolation of components.
Python 3.10+: Required if you use code-first frameworks like LangGraph, CrewAI, or custom agent scripts. Most orchestration frameworks are Python-based.
Technical Skills Required
Self-hosting AI does not require machine learning expertise. The skills you need are closer to standard system administration and DevOps.
Essential skills: Comfort with the Linux command line, basic Docker operations (docker compose up, docker logs, docker exec), understanding of networking concepts (ports, firewall rules, reverse proxies), and the ability to read and follow technical documentation.
Helpful but not required: Python programming (for custom agent development), database administration (for vector databases and monitoring), and Kubernetes knowledge (only for large-scale deployments).
Not required: Machine learning theory, neural network architecture, model training, or GPU programming. The tools available in 2026 abstract all of this away. You interact with models through APIs and configuration files, not through code that touches model internals.
Budget Summary by Tier
Starter tier ($500 to $1,000): CPU-only setup or used GPU. Runs 7B models via llama.cpp. Suitable for experimentation and light personal use. Can use an existing computer if it has a compatible GPU.
Developer tier ($1,500 to $3,000): Dedicated workstation with RTX 4060 Ti 16 GB or RTX 4070. Runs 7B to 13B models comfortably. Suitable for development, testing, and light production workloads for a small team.
Professional tier ($3,500 to $5,000): Workstation with RTX 4090. Runs models up to 34B parameters. Suitable for production workloads serving a team of 10 to 50 users with multiple concurrent agent sessions.
Enterprise tier ($15,000+): Server-grade hardware with A100 or H100 GPUs. Runs 70B+ models with high concurrency. Suitable for organization-wide deployment, high-volume processing, and demanding quality requirements.
An RTX 4060 Ti 16 GB ($450) or RTX 4090 ($1,800) paired with 32 GB RAM, a 1 TB NVMe SSD, and basic Linux and Docker knowledge is enough to run production-quality self-hosted AI agents for most use cases.