Running Llama Models Locally

Updated May 2026
Meta Llama is the most widely deployed open-weight model family for self-hosting. The lineup spans from efficient 8B parameter models that run on a laptop to the 400B parameter Maverick that competes with cloud-only offerings. Every Llama model is free to download, and most can be running locally within minutes using Ollama or similar tools.

The Llama Model Family

Meta has released multiple generations of Llama models, each with significant improvements. The currently relevant models for self-hosting fall into two categories: the dense Llama 3.x series and the MoE (Mixture of Experts) Llama 4 series.

Llama 3.1 remains the workhorse of local LLM deployments. Available in 8B, 70B, and 405B parameter sizes, the 8B version runs on virtually any modern computer while delivering surprisingly capable performance for its size. The 70B version is the sweet spot for organizations that need high quality but want to run on a single GPU. Both support a 128K token context window, sufficient for processing long documents or maintaining extended conversations. The 405B version exists primarily for research and large-scale deployments, requiring multiple high-end GPUs.

Llama 3.3 refined the 70B parameter model specifically, improving multilingual support and benchmark performance while maintaining the same hardware requirements as 3.1. If you are running a 70B Llama model, 3.3 is the version to use.

Llama 4 Scout introduced the MoE architecture to the Llama family in April 2025. It has 109 billion total parameters organized into 16 experts, but only activates 17 billion parameters per token. This means it delivers large-model quality at small-model inference costs. Its most remarkable feature is a 10 million token context window, the longest of any open model. Despite the large total parameter count, Scout runs on a single H100 GPU or on an Apple Silicon Mac with 32GB+ unified memory when quantized to 4-bit precision.

Llama 4 Maverick scales to 400 billion total parameters with 128 experts, still activating only 17B per token. It targets production deployments where maximum quality matters, achieving competitive benchmark scores against cloud-only models. Maverick requires multi-GPU setups (typically 2-4 H100 GPUs) and is impractical on consumer hardware.

Hardware Requirements by Model

The following table shows approximate memory requirements at different quantization levels:

Llama 3.1 8B: Full precision requires about 16GB. At Q4_K_M quantization, it needs roughly 5GB. This runs comfortably on any modern laptop with 16GB of RAM or an entry-level GPU with 8GB of VRAM.

Llama 3.3 70B: Full precision requires about 140GB. At Q4_K_M quantization, it needs roughly 40GB. A single A100 80GB GPU handles the full-precision version. For quantized inference, a Mac with 64GB unified memory, two RTX 4090s (48GB total VRAM), or a single H100 all work well.

Llama 4 Scout (109B total, 17B active): Despite the large total parameter count, the MoE architecture means memory requirements depend on the total parameter count, not active parameters. At Q4_K_M quantization, Scout needs approximately 60GB. A single H100 (80GB) handles it with room for context. Apple Silicon Macs with 64GB+ unified memory can also run it.

Llama 4 Maverick (400B total, 17B active): Requires approximately 200GB at Q4 quantization. Minimum practical setup is 4x H100 GPUs with tensor parallelism.

Running Llama with Ollama

The fastest path to running any Llama model locally is through Ollama. Install Ollama from ollama.com, then pull and run a model with two commands:

ollama pull llama3.2 downloads the default Llama 3.2 3B model. ollama pull llama3.1:70b downloads the 70B version. ollama pull llama3.1:8b-instruct-q5_K_M downloads a specific quantization. Once downloaded, ollama run llama3.1 starts an interactive chat session, and the API becomes available at http://localhost:11434 for programmatic access.

Ollama automatically selects the best execution strategy for your hardware: GPU acceleration if a compatible GPU is detected, CPU inference otherwise, and hybrid CPU/GPU execution when the model is too large for VRAM alone.

Running Llama with vLLM

For production workloads, vLLM provides higher throughput and better concurrency handling. Install vLLM via pip, then start the server pointing to a Hugging Face model identifier. vLLM downloads the model, loads it onto your GPU(s), and begins serving an OpenAI-compatible API. For Llama 4 models, vLLM supports tensor parallelism across multiple GPUs, allowing Maverick to run across a 4-GPU server.

Licensing Considerations

Llama models use the Llama Community License, which permits commercial use for organizations with fewer than 700 million monthly active users. Attribution ("Built with Llama") is required. The model weights are free to download from llama.com and Hugging Face. One notable restriction: the multimodal (vision) capabilities of Llama 4 are excluded for EU-domiciled licensees under the current license terms.

Key Takeaway

For most self-hosting scenarios, Llama 3.1 8B is the best starting point for experimentation, Llama 3.3 70B is the best quality-per-dollar choice for production, and Llama 4 Scout offers frontier-quality results with remarkable hardware efficiency thanks to MoE architecture.