Model Quantization: Smaller Models, Less RAM

Updated May 2026
Quantization reduces the precision of model weights from 16-bit or 32-bit floating point numbers down to 8, 5, 4, or even 2 bits per weight. This dramatically shrinks memory requirements, often by 3-4x, while preserving 95-98% of model quality. It is the single most important technique that makes large language models practical to run on consumer hardware.

What Quantization Does

A language model consists of billions of numerical parameters (weights) learned during training. At full 16-bit (FP16) precision, each weight occupies 2 bytes of memory. A 70 billion parameter model therefore needs approximately 140GB just for the weights, far more than any consumer GPU can hold.

Quantization reduces the number of bits used to represent each weight. Instead of a 16-bit floating point number that can represent thousands of distinct values, a 4-bit quantized weight uses only 16 possible values. The model is slightly less precise, but the memory savings are enormous: that same 70B model drops from 140GB to roughly 35-40GB at 4-bit precision, fitting on a single 48GB GPU or a Mac with 64GB unified memory.

The process works by analyzing groups of weights and finding the best way to map them to the available precision levels. More sophisticated quantization methods use different bit widths for different layers, higher precision for layers that are most sensitive to quantization errors and lower precision for layers that tolerate approximation well.

Quantization Formats

GGUF (GPT-Generated Unified Format) is the dominant format for quantized models used with llama.cpp, Ollama, and LM Studio. GGUF files contain both the quantized weights and the metadata needed to load them, in a single self-contained file. The format supports numerous quantization schemes identified by names like Q4_K_M, Q5_K_S, and Q8_0.

GPTQ is a GPU-optimized quantization format that performs especially well with vLLM and other CUDA-based inference engines. GPTQ uses a calibration dataset during quantization to minimize quality loss, often outperforming simpler methods at the same bit width.

AWQ (Activation-Aware Weight Quantization) takes a different approach by identifying which weights have the greatest impact on model activations and preserving those at higher precision. AWQ models often achieve better quality than equivalent GPTQ models at the same compression ratio.

NVFP4 is NVIDIA proprietary 4-bit format optimized for their Blackwell architecture GPUs. It achieves the highest inference throughput on compatible hardware but is limited to NVIDIA latest generation.

Understanding GGUF Quantization Levels

GGUF quantization names follow a pattern: Q[bits]_[method]_[size]. The most common levels and their characteristics:

Q2_K: 2-bit quantization. Maximum compression but noticeable quality loss. Only recommended when memory is extremely constrained and quality is secondary.

Q3_K_M: 3-bit quantization with medium grouping. Significant memory savings with moderate quality impact. Useful for fitting very large models into limited memory.

Q4_K_M: 4-bit quantization with medium grouping. The most popular quantization level. Offers an excellent balance between memory savings (roughly 4x reduction from FP16) and quality retention (typically 95-97% of full-precision benchmarks). This is the default for most Ollama model downloads.

Q5_K_M: 5-bit quantization with medium grouping. The sweet spot for users who want the best quality while still getting substantial memory savings (roughly 3x reduction). Quality retention is typically 97-99% of full precision. Recommended when you have enough memory and want to maximize output quality.

Q6_K: 6-bit quantization. Very close to full-precision quality with roughly 2.5x memory savings. Diminishing returns compared to Q5_K_M for most use cases.

Q8_0: 8-bit quantization. Nearly indistinguishable from full precision in output quality with 2x memory savings. Primarily used when maximum quality is required but full FP16 does not fit in memory.

Quality Impact in Practice

The quality impact of quantization has improved dramatically. In 2024, 4-bit quantization caused noticeable degradation in complex reasoning tasks. By 2026, improved quantization algorithms (particularly the K-quant methods used in GGUF) have largely solved this problem. On standard benchmarks like MMLU, HumanEval, and GSM8K, Q4_K_M models typically score within 2-3% of their full-precision equivalents.

The areas where quantization impact is most noticeable are creative writing (slightly less varied vocabulary choices), mathematical reasoning (occasional errors on multi-step calculations that full precision handles correctly), and very long context usage (cumulative precision errors can affect coherence at extreme context lengths). For most practical applications, including coding assistance, summarization, Q&A, classification, and general chat, 4-bit quantization is effectively invisible to the end user.

Practical Recommendations

For development and experimentation, start with Q4_K_M. It gives you the most model quality per gigabyte of RAM and works well for virtually all tasks. If you notice quality issues on specific tasks, try Q5_K_M before switching to a larger model. Often, moving up one quantization level solves the problem more efficiently than doubling the model size.

For production deployments where quality is paramount, use Q5_K_M or Q6_K. The additional memory cost is modest (roughly 25% more than Q4) and the quality improvement, while small on benchmarks, can matter for customer-facing applications.

For memory-constrained environments (laptops with 8-16GB RAM), Q4_K_M or even Q3_K_M lets you run models that would otherwise be impossible, making quantization the difference between having local AI or not.

Key Takeaway

Q4_K_M is the default choice for most self-hosted LLM deployments. It reduces memory by 4x while keeping 95-97% of full-precision quality. Use Q5_K_M when quality matters more than memory savings, and Q3_K_M when fitting the model into memory is the priority.