How to Choose a Self-Hosted LLM

Updated May 2026
Choosing the right self-hosted LLM requires balancing four factors: your hardware capacity, your primary use case, your quality requirements, and the model ecosystem you want to work within. This guide walks through each decision point to help you select the best model for your situation.

Step 1: Assess Your Hardware

Your hardware determines the maximum model size you can run. Start by identifying your available memory:

8-16GB system RAM, no GPU: You are limited to 3-8B parameter models at Q4 quantization. Recommended models: Llama 3.2 3B, Phi-3 Mini 3.8B, Gemma 2B. These run via CPU inference in Ollama at 5-15 tokens per second.

16-24GB GPU VRAM (RTX 4070-4090): You can run 7-13B models at full precision or 30B models at Q4. Recommended models: Llama 3.1 8B (versatile), Codestral (coding), Ministral 14B (balanced). Inference speed is 30-80 tokens per second.

32-64GB unified memory (Apple Silicon): You can run 30B models at full precision or 70B models at Q4. Recommended models: Llama 3.3 70B Q4 (best general quality), Llama 4 Scout Q4 (best context window). Inference speed is 15-30 tokens per second.

80GB+ GPU VRAM (A100, H100): Full 70B models at full precision or 100B+ MoE models. Recommended models: Llama 3.3 70B FP16, Llama 4 Scout, Mistral Small 4. Production-grade throughput via vLLM.

Step 2: Define Your Use Case

General chat and Q&A: Any well-trained instruction model works. Llama 3.1 8B is the minimum viable option. Llama 3.3 70B provides excellent quality. Prioritize model size within your hardware budget.

Code generation and analysis: Use models trained or fine-tuned for coding. Codestral is purpose-built for this. Mistral Medium 3.5 scores highest on coding benchmarks among open models. Llama 3.1 with a coding-focused system prompt also works well.

Document analysis and RAG: Long context windows matter. Llama 4 Scout (10M tokens) or Llama 3.1 (128K tokens) are strong choices. Pair with an embedding model (nomic-embed-text) for retrieval.

AI agents and tool calling: Prioritize reliable tool calling and strong reasoning. Llama 4 Scout, Mistral Small 4, and Qwen 2.5 72B all handle multi-tool scenarios well.

Specialized domains (medical, legal, financial): Start with a strong base model (Llama 3.1 8B or 70B), then fine-tune on your domain data. A fine-tuned small model often outperforms a generic large model for domain-specific tasks.

Step 3: Set Quality Requirements

Be honest about your quality needs. Many applications work perfectly well with a 7B model. If your use case is classification, entity extraction, summarization of structured data, or template-based generation, a smaller model saves hardware costs and provides faster responses without meaningful quality loss.

If your application involves complex reasoning, creative writing, nuanced instruction following, or customer-facing interactions where quality directly affects user experience, invest in the largest model your hardware supports. The quality difference between 7B and 70B is substantial for these tasks.

Step 4: Choose a Model Family

Llama is the default choice for most scenarios. Largest ecosystem, most fine-tuned variants, best community support. Use Llama unless you have a specific reason to choose something else.

Mistral is the best alternative. Stronger on coding tasks, permissive Apache 2.0 licensing, and competitive performance. Choose Mistral for coding-heavy workloads or if licensing matters.

Qwen excels at multilingual tasks (especially Chinese and East Asian languages) and has strong tool calling. Choose Qwen for multilingual applications.

DeepSeek focuses on reasoning and mathematical capability. Choose DeepSeek for math-heavy or analytical workloads.

Step 5: Select Quantization

Start with Q4_K_M. This gives the best balance of memory savings and quality. If you have extra memory headroom, try Q5_K_M for slightly better quality. Only use Q3 or Q2 if the model absolutely cannot fit at Q4.

The general rule: it is better to run a larger model at lower quantization than a smaller model at higher quantization. A 70B Q4 model almost always outperforms a 13B Q8 model, even though both use roughly the same memory.

Step 6: Test and Evaluate

Download and test 2-3 candidate models on representative tasks from your actual workload. Use Ollama for quick testing. Compare output quality, response speed, and reliability across your specific use cases. Benchmarks provide a starting point, but nothing replaces testing with your own data and prompts.

Key Takeaway

Start with your hardware constraints, match to your use case, then pick the largest model from the right family that fits. Llama is the safe default. Test with your actual workload before committing to a model for production.