Training data generated by AI. It is used to supplement the lack of real data and to train and evaluate models while protecting privacy.
Synthetic data refers to datasets artificially generated by AI or rule-based algorithms, rather than using real data directly. It is widely used for model training, evaluation, and distillation.
Real data faces three fundamental barriers: insufficient volume, inherent bias, and the inclusion of personally identifiable information. In the medical field, for example, image data for rare diseases is extremely scarce, and in finance, fraudulent transaction data often accounts for less than 0.1% of the total. Synthetic data is a practical means of bridging these gaps.
Its combination with knowledge distillation is rapidly gaining traction. The pipeline involves feeding diverse prompts to a large teacher model to generate responses, then using that output as training data for a student model — a workflow validated by the success of the Microsoft Phi series.
It is also used to create fine-tuning training data. An approach in which LLMs automatically generate Q&A pairs from internal documents, which are then used to improve the response quality of RAG systems, has proven effective in the author's own projects as well.
Training exclusively on synthetic data can lead to "model collapse," where a model progressively reinforces its own output patterns. An operational design that manages the mixing ratio with real data and incorporates regular human quality verification is essential.


A system that integrates AI into digital replicas of physical assets or processes to perform real-time analysis, prediction, and optimization.

Shadow AI refers to the collective term for AI tools and services used by employees in their work without the approval of the company's IT department or management. It carries risks of information leakage and compliance violations.

An architecture that runs AI inference on-device rather than in the cloud. It enables low latency, privacy protection, and offline operation.


How Thai Manufacturers Can Get Started with AI-Powered Predictive Maintenance and Quality Control

A design approach that structurally eliminates the risk of personal data leakage by physically and logically isolating AI systems and data processing infrastructure. Typical examples include tenant separation and on-premises operation.