A technique that transfers knowledge from a large teacher model to a small student model, creating a lightweight yet high-accuracy model.
Knowledge Distillation is a technique in which a smaller "student model" is trained using the output distribution of a large "teacher model" as training data. By mimicking the inference patterns of the teacher model, the student model can maintain high accuracy while significantly reducing the number of parameters.
Deploying an LLM with tens of billions of parameters directly in a production environment makes GPU costs and latency a business constraint. On the other hand, training a small model from scratch makes it difficult to achieve the same level of accuracy as a large model. Distillation is a practical approach that resolves this contradiction.
For example, Microsoft's Phi series distills small models using synthetic data generated by large models, achieving performance that rivals large models despite being an SLM (Small Language Model).
Fine-tuning is a technique that adjusts the weights of an existing model to specialize it for a specific task, without changing the model size. Distillation differs in that it reduces the model size itself. In practice, a pipeline in which the model is first made smaller through distillation and then adapted to a business domain using LoRA or similar methods is becoming increasingly common.
Tasks that the teacher model struggles with will also be difficult for the student model. Additionally, since a large volume of outputs must be generated from the teacher model, the computational cost of the distillation process itself cannot be overlooked.


Fine-tuning refers to the process of providing additional training data to a pre-trained machine learning model in order to adapt it to a specific task or domain.

A inference acceleration technique in which a small draft model proposes multiple tokens speculatively in advance, and a large model verifies them in parallel.

An optimization technique that compresses model size by reducing parameter precision from 16-bit to 4-bit or similar, enabling inference with limited computational resources.


How to Streamline In-House Training and Knowledge Transfer with AI

A Dense Model is a neural network architecture in which all of the model's parameters are used for computation during inference. In contrast to MoE (Mixture of Experts), which activates only a subset of experts, a Dense Model always involves all weights in computation regardless of the input.