Quantization (Quantization)

Quantization (Quantization)

An optimization technique that compresses model size by reducing parameter precision from 16-bit to 4-bit or similar, enabling inference with limited computational resources.

What is Quantization?

Quantization is an optimization technique that reduces the numerical precision of a model's weight parameters (e.g., 32-bit floating point → 4-bit integer) to compress model size and memory usage.

Intuitive Understanding

It is similar to how reducing a photo's image quality decreases its file size. While the amount of information per parameter decreases, the model's overall performance is maintained to a surprisingly high degree. Applying 4-bit quantization to a 70B parameter model shrinks VRAM consumption from approximately 140GB to around 35GB, making inference possible without expensive GPU clusters.

Types of Quantization

MethodCharacteristics
Post-Training Quantization (PTQ)Quantizes an already-trained model as-is. Straightforward, but may result in significant accuracy degradation.
Quantization-Aware Training (QAT)Trains with quantization in mind. More accurate than PTQ, but requires training costs.
GPTQ / AWQ / GGUFQuantization formats optimized for LLMs. Widely adopted as distribution formats for local LLMs.

QLoRA is a technique that combines quantization with LoRA, enabling fine-tuning in a 4-bit quantized state.

Practical Decision Criteria

Multiple research findings have reported that "quantizing a larger model" yields higher performance than "using a smaller model at full precision." When selecting a model for edge AI environments, finding the optimal solution involves exploring combinations of model size and quantization bit-width.