What Is a Self-Hosted LLM
How Self-Hosted LLMs Differ from Cloud APIs
Cloud LLM services like the OpenAI API, Anthropic API, or Google Vertex AI operate on a simple model: you send a prompt over the internet, the provider runs inference on their GPU clusters, and you receive a response. The provider handles model hosting, scaling, updates, and infrastructure. You pay per token processed, typically ranging from $0.15 to $15 per million tokens depending on the model.
Self-hosted LLMs work differently. You download the model weights, which are the learned parameters that define the model behavior, and load them into an inference runtime on your own hardware. The model runs as a local process. When your application sends a prompt, it goes to a local endpoint (typically localhost) rather than across the internet. The model generates its response using your CPU, GPU, or both, and returns it directly. No data leaves your network, no per-token billing occurs, and no third-party rate limits apply.
The key distinction is ownership of the inference pipeline. With cloud APIs, you rent access to a model running on remote hardware. With self-hosting, you own the entire stack: the hardware, the runtime, the model weights, and the configuration. This ownership comes with both freedom and responsibility.
What You Actually Download
When you self-host a model, you download a file (or set of files) containing the model weights. These weights are numerical parameters, billions of them, that the model learned during training. The most common format for self-hosted models is GGUF (GPT-Generated Unified Format), which is optimized for inference on consumer hardware, though models are also available in safetensors format for GPU-optimized runtimes like vLLM.
Model sizes vary enormously. A small 3-billion parameter model in 4-bit quantization might be 2GB. A 70-billion parameter model at 4-bit quantization runs around 40GB. The full-precision version of the same 70B model would be roughly 140GB. The Llama 4 Scout model, with 109 billion total parameters, needs approximately 60GB in quantized form. You choose the model size based on your hardware capacity and quality requirements.
These models are typically published on Hugging Face, a repository for machine learning models and datasets. Meta, Mistral, Google, and other organizations release their open-weight models there. Tools like Ollama simplify this further by maintaining their own model library where you pull models by name, similar to how Docker pulls container images.
The Inference Runtime
Model weights alone do not generate text. You need an inference runtime, software that loads the weights into memory, accepts prompts, performs the mathematical operations (matrix multiplications, attention calculations, token sampling), and produces output tokens. The runtime is the engine that makes the model work.
The most common runtimes in 2026 are Ollama for development and prototyping, vLLM for production serving, and llama.cpp as the underlying engine that powers several higher-level tools. Each runtime handles the complex process of loading multi-gigabyte model files into GPU memory (or system RAM for CPU inference), managing the key-value cache that stores conversation context, and orchestrating the token-by-token generation process.
Most runtimes expose an HTTP API endpoint, typically compatible with the OpenAI API format. This means your existing application code that calls OpenAI can often point to your local model by changing a single URL, with no other code changes required.
Hardware Requirements at a Glance
The hardware you need depends entirely on which model you want to run. Smaller models (3-8B parameters) run comfortably on a modern laptop with 16GB of RAM, using CPU inference. Mid-range models (13-30B parameters) benefit significantly from a dedicated GPU with 16-24GB of VRAM, such as an NVIDIA RTX 4090. Large models (70B+ parameters) typically require server-grade GPUs like the NVIDIA A100 or H100 with 80GB of VRAM, or multiple consumer GPUs working together.
Apple Silicon Macs deserve special mention because their unified memory architecture, where CPU and GPU share the same RAM pool, makes them unusually capable for local inference. A Mac Studio with 64GB of unified memory can run 70B parameter models that would require a dedicated GPU on other platforms.
Quantization reduces hardware requirements dramatically. A 70B parameter model at full 16-bit precision needs roughly 140GB of memory. The same model quantized to 4-bit precision fits in approximately 40GB, with minimal quality loss on most tasks. This is what makes large model self-hosting practical on hardware that costs thousands rather than tens of thousands of dollars.
Who Self-Hosts and Why
Self-hosting appeals to several distinct groups. Individual developers and hobbyists run local models for experimentation, coding assistance, and learning. Startups use self-hosted models to avoid API costs during development and to maintain data privacy for their customers. Enterprises self-host for regulatory compliance, particularly in healthcare, finance, and government sectors where data cannot leave organizational boundaries. Research teams self-host to run experiments that would be prohibitively expensive at cloud API prices, often processing billions of tokens during evaluation runs.
The common thread is the need for one or more of: data privacy, cost control at scale, customization capability, or operational independence from third-party services. If none of these factors matter for your use case, cloud APIs remain the simpler choice.
A self-hosted LLM gives you full ownership of the AI inference pipeline, from model weights to hardware. The tradeoff is accepting operational responsibility in exchange for data privacy, cost control, and customization freedom.