An optimization technique that compresses model size by reducing parameter precision from 16-bit to 4-bit or similar, enabling inference with limited computational resources.
## What is Quantization? Quantization is an optimization technique that reduces the numerical precision of a model's weight parameters (e.g., 32-bit floating point → 4-bit integer) to compress model size and memory usage. ### Intuitive Understanding It is similar to how reducing a photo's image quality decreases its file size. While the amount of information per parameter decreases, the model's overall performance is maintained to a surprisingly high degree. Applying 4-bit quantization to a 70B parameter model shrinks VRAM consumption from approximately 140GB to around 35GB, making inference possible without expensive GPU clusters. ### Types of Quantization | Method | Characteristics | |------|------| | Post-Training Quantization (PTQ) | Quantizes an already-trained model as-is. Straightforward, but may result in significant accuracy degradation. | | Quantization-Aware Training (QAT) | Trains with quantization in mind. More accurate than PTQ, but requires training costs. | | GPTQ / AWQ / GGUF | Quantization formats optimized for LLMs. Widely adopted as distribution formats for local LLMs. | QLoRA is a technique that combines quantization with LoRA, enabling fine-tuning in a 4-bit quantized state. ### Practical Decision Criteria Multiple research findings have reported that "quantizing a larger model" yields higher performance than "using a smaller model at full precision." When selecting a model for edge AI environments, finding the optimal solution involves exploring combinations of model size and quantization bit-width.


QLoRA (Quantized LoRA) is a method that combines LoRA with 4-bit quantization, enabling fine-tuning of large language models even on consumer-grade GPUs.

Inference-time scaling is a technique that dynamically increases or decreases the amount of computation used during a model's inference phase, allocating more "thinking steps" to difficult problems while providing immediate answers to simpler ones.

A Sparse Model is a general term for neural network architectures that activate only a subset of the model's parameters during inference, rather than all of them. A representative example is MoE (Mixture of Experts), which adopts a scaling strategy distinct from that of Dense Models — increasing the total parameter count while keeping inference costs low.


Local LLM / SLM Deployment Comparison — AI Utilization Without Cloud API Dependency