Knowledge Distillation (Knowledge Distillation)

Knowledge Distillation (Knowledge Distillation)

A technique that transfers knowledge from a large teacher model to a small student model, creating a lightweight yet high-accuracy model.

What is Knowledge Distillation?

Knowledge Distillation is a technique in which a smaller "student model" is trained using the output distribution of a large "teacher model" as training data. By mimicking the inference patterns of the teacher model, the student model can maintain high accuracy while significantly reducing the number of parameters.

Why is Distillation Necessary?

Deploying an LLM with tens of billions of parameters directly in a production environment makes GPU costs and latency a business constraint. On the other hand, training a small model from scratch makes it difficult to achieve the same level of accuracy as a large model. Distillation is a practical approach that resolves this contradiction.

For example, Microsoft's Phi series distills small models using synthetic data generated by large models, achieving performance that rivals large models despite being an SLM (Small Language Model).

Differences from Fine-Tuning

Fine-tuning is a technique that adjusts the weights of an existing model to specialize it for a specific task, without changing the model size. Distillation differs in that it reduces the model size itself. In practice, a pipeline in which the model is first made smaller through distillation and then adapted to a business domain using LoRA or similar methods is becoming increasingly common.

Limitations of Distillation

Tasks that the teacher model struggles with will also be difficult for the student model. Additionally, since a large volume of outputs must be generated from the teacher model, the computational cost of the distillation process itself cannot be overlooked.