A technique that transfers knowledge from a large teacher model to a small student model, creating a lightweight yet high-accuracy model.
## What is Knowledge Distillation? Knowledge Distillation is a technique in which a smaller "student model" is trained using the output distribution of a large "teacher model" as training data. By mimicking the inference patterns of the teacher model, the student model can maintain high accuracy while significantly reducing the number of parameters. ### Why is Distillation Necessary? Deploying an LLM with tens of billions of parameters directly in a production environment makes GPU costs and latency a business constraint. On the other hand, training a small model from scratch makes it difficult to achieve the same level of accuracy as a large model. Distillation is a practical approach that resolves this contradiction. For example, Microsoft's Phi series distills small models using synthetic data generated by large models, achieving performance that rivals large models despite being an SLM (Small Language Model). ### Differences from Fine-Tuning Fine-tuning is a technique that adjusts the weights of an existing model to specialize it for a specific task, without changing the model size. Distillation differs in that it reduces the model size itself. In practice, a pipeline in which the model is first made smaller through distillation and then adapted to a business domain using LoRA or similar methods is becoming increasingly common. ### Limitations of Distillation Tasks that the teacher model struggles with will also be difficult for the student model. Additionally, since a large volume of outputs must be generated from the teacher model, the computational cost of the distillation process itself cannot be overlooked.


Fine-tuning refers to the process of providing additional training data to a pre-trained machine learning model in order to adapt it to a specific task or domain.

A Dense Model is a neural network architecture in which all of the model's parameters are used for computation during inference. In contrast to MoE (Mixture of Experts), which activates only a subset of experts, a Dense Model always involves all weights in computation regardless of the input.

LoRA (Low-Rank Adaptation) is a technique that inserts low-rank delta matrices into the weight matrices of large language models and trains only those deltas, enabling fine-tuning by adding approximately 0.1–1% of the total model parameters.

Thailand PDPA Compliance Checklist: Balancing Regulatory Requirements with AI Utilization