Running Mistral Models Locally
The Mistral Model Lineup in 2026
Mistral has rapidly expanded from a single model to a full family spanning different sizes and specializations. The models relevant for self-hosting fall into three tiers.
Mistral Small 4
Released in March 2026, Mistral Small 4 is a MoE model with 119 billion total parameters that activates only 24 billion per inference pass. It combines four capabilities that previously required separate models: instruction following, deep reasoning, image understanding, and code generation. Despite its large total parameter count, the MoE architecture means inference costs roughly match a 24B dense model.
Small 4 runs on vLLM, llama.cpp, SGLang, Hugging Face Transformers, and NVIDIA NIM. Hardware requirements depend on quantization: at FP8 precision, it needs approximately 120GB of VRAM (two H100 GPUs). At Q4 quantization for llama.cpp or Ollama, it fits in approximately 65-70GB, making it runnable on high-memory Apple Silicon Macs or a single H100. Early benchmarks show 40% lower latency and 3x throughput compared to the previous Small 3 model.
Mistral Medium 3.5
Released in April 2026, Medium 3.5 is a 128 billion parameter dense model (not MoE). It is Mistral first flagship merged model, combining instruction following, reasoning, and coding into a single set of weights. It scores 77.6% on SWE Bench Verified, a leading coding benchmark, and features configurable reasoning effort per request, allowing the same model to handle quick chat replies and complex multi-step analysis.
Self-hosting Medium 3.5 requires significant hardware. At FP8 precision, the model needs approximately 128GB of VRAM for weights alone, plus additional memory for KV cache and context. The practical minimum is four GPUs with 80GB each (320GB total), such as four H100s or four A100 80GB GPUs. This is a datacenter model, not a desktop model.
Specialized Models
Codestral is Mistral dedicated coding model, optimized for code generation, completion, and analysis. It supports over 80 programming languages and integrates with popular IDEs. For teams that primarily need AI-assisted coding, Codestral offers better code quality per parameter than general-purpose models.
Ministral 3 (14B) is the small, efficient option in the Mistral lineup. At 14 billion parameters, it runs comfortably on a single consumer GPU (RTX 4090 or similar) and provides solid performance for simpler tasks like summarization, classification, and basic Q&A.
Mistral OCR processes documents, extracting text from images and PDFs with layout awareness. This is particularly useful for enterprises that need to digitize paper documents or process scanned forms.
Mistral vs Llama: How to Choose
Both model families are strong choices, and the best option depends on your specific requirements. Mistral models tend to excel at coding tasks and structured output generation. Llama models have a larger ecosystem of fine-tuned variants and community tooling. Mistral Small 4 multimodal capabilities (text and image) make it a strong choice for applications that need vision, while Llama 4 Scout offers the superior context window (10M tokens vs the typical 128K-256K for Mistral models).
At the smaller end, Llama 3.1 8B and Ministral 14B serve different niches: the 8B model runs on minimal hardware while the 14B model delivers noticeably better quality at modest additional cost. At the larger end, Llama 4 Scout MoE architecture gives it a hardware efficiency advantage over Mistral Medium 3.5 dense architecture for equivalent quality levels.
Setting Up Mistral Models
With Ollama, running Mistral models follows the same pattern as any other model: ollama pull mistral downloads the default Mistral model, while ollama pull mistral-small or ollama pull codestral pulls specific variants. The Ollama library maintains pre-configured versions with appropriate quantization for each model.
For vLLM, Mistral models are available on Hugging Face under the mistralai organization. Load them with the standard vLLM server command, specifying the model identifier. For MoE models like Small 4, vLLM handles the expert routing automatically.
Licensing
Most 2026 Mistral models ship under Apache 2.0 or equivalent permissive licenses, allowing unrestricted commercial use. This is a notable advantage over Llama models, which impose a 700M MAU threshold on commercial use. For startups or products that could eventually reach large scale, Mistral permissive licensing eliminates a potential future constraint.
Mistral models offer strong competition to Llama, with particular advantages in coding performance, permissive licensing, and the combination of multimodal and reasoning capabilities in a single model. Small 4 is the standout for self-hosting efficiency.