
Synthetic testing is a method for automatically evaluating AI systems using test cases generated from synthetic data. It "synthesizes" test inputs—such as edge cases, adversarial inputs, and regulatory requirements that cannot be covered by production logs alone—and uses them as the starting point for regression detection and quality assurance.
This article is intended for QA engineers, PMs, and developers at companies adopting AI. It covers the definition of synthetic testing, how it differs from LLM-as-a-Judge, the quality dimensions it can evaluate, a four-step implementation process, and common pitfalls. By the end, readers will be equipped to design an initial implementation of synthetic testing for their own AI projects.
Synthetic testing is a method that uses synthetic data to create test inputs, enabling continuous measurement of AI quality by covering scenarios that cannot be collected from real data. Related terms include "synthetic data" and "LLM-as-a-Judge," but each plays a distinct role. This section first clarifies the scope of synthetic testing and establishes how it should be used in conjunction with evaluators (judges).
Synthetic testing is a testing method that feeds "synthetic data" into the input side of an AI system and automatically determines whether the expected output is produced. In conventional software testing (such as unit tests and e2e tests), developers manually prepare input values and expected outputs. Synthetic testing, by contrast, is characterized by generating and transforming large volumes of test inputs to improve coverage.
Synthetic data refers to data that is artificially created using rules, statistical models, generative AI, or similar methods, rather than being directly copied from real data. The primary roles of synthetic data in synthetic testing can be organized into three categories:
In short, synthetic testing is a combination of "synthetic data × automated evaluation," with synthetic data serving as its fuel.
A term often confused with synthetic testing is LLM-as-a-Judge. The two have clearly distinct roles in AI evaluation.
| Aspect | Synthetic Testing | LLM-as-a-Judge |
|---|---|---|
| Role | Synthesizes test inputs to ensure coverage | Scores and judges test outputs |
| Primarily handles | Prompts, scenarios, adversarial inputs | Scores, pass/fail verdicts, feedback comments |
| Self-contained? | Requires a separate mechanism for scoring | Requires input data to score against |
In practice, the two are used together. A common workflow is to use synthetic testing to "generate 100 test cases" and then use LLM-as-a-Judge to "score the 100 responses." The design of the judge itself and how to craft judge prompts are covered in detail in a separate article: "What is LLM-as-a-Judge? A Method for Evaluating AI Output with AI and Implementing Hallucination Detection."
Generative AI responses can vary even given identical inputs. It has become necessary to proactively catch quality degradation that cannot be detected from real data alone by using synthetic data. In addition, with regulatory developments such as the EU AI Act—which has already come into full effect—mandating the submission of evaluation records across an expanding range of domains, the coverage provided by synthetic testing is becoming a prerequisite for audit compliance among AI developers.
In traditional software, collecting production logs was sufficient to cover most representative input patterns. For generative AI systems, that approach falls short for three reasons.
First, the conditions under which hallucinations occur are highly individual and rarely surface in logs. Since the same question does not necessarily produce the same hallucination each time, "collecting" real-world data does not enable reproducible regression tests.
Second, with adversarial inputs such as prompt injection, damage has already occurred by the time a user's attempt appears in the logs. Waiting for real data is too late; attack scenarios must be tested in advance using synthetic data — effectively extending the concept of penetration testing to AI systems.
Third, AI agents engage in extended interactions that include calls to external APIs and tools. The combinatorial space of edge cases grows exponentially, meaning production logs alone will always provide insufficient coverage.
Synthetic data makes it possible to freely generate these inputs — ones that either never appear in real data or cannot be waited upon — as synthetic tests.
The EU AI Act, now fully in force, requires documentation of testing and evaluation records for high-risk AI systems. In Japan, the Cabinet Office's AI Business Operator Guidelines and guidance from the Ministry of Economy, Trade and Industry (METI) and the Ministry of Internal Affairs and Communications (MIC) are also moving toward recommending risk-based evaluation.
What these frameworks have in common is the requirement to "anticipate risk scenarios in advance and verify through documented procedures that the system behaves as expected." Relying solely on real data risks a finding of "not verified" if an anticipated scenario never appears in production logs. Synthetic testing is well-suited to regulatory compliance because it allows risk scenarios to be deliberately constructed as synthetic data and evaluation results to be preserved in a reproducible form.
Synthetic testing is not a measure of a single quality metric — it can address functional quality, safety, and robustness in an integrated manner. Which dimension to prioritize depends on the use case of the AI system, but failing to design test cases across at least these three axes tends to produce AI that appears to work yet breaks down in production.
The most fundamental dimension is functional quality — the axis that measures whether the task given to the AI is being accomplished correctly. For RAG, this means "can the system cite the correct sources in response to a question?"; for summarization, "does it capture all key facts without omission?"; and for agents, "does it invoke tools in accordance with the user's intent?"
When measuring functional quality through synthetic testing, synthetic data is used to create both correct-answer patterns and typical error patterns for a given task, and the Judge is verified to be able to distinguish between the two. In practice, it is more stable to first fix the Judge's scoring criteria (accuracy, faithfulness, completeness, etc.) and then design test cases that are capable of scoring points along those criteria.
Safety is the dimension that measures whether the AI avoids producing responses it should not produce. Representative attack scenarios include prompt injection, jailbreaking, and leakage of confidential information. In synthetic testing, these attack strings are templatized and generated at scale to verify that AI guardrails function correctly.
A practical approach to synthetic attack scenarios is to start from publicly available red-teaming dictionaries and extend them for your own organization's context. For example, patterns such as "cause the system to output an entire operations manual" or "cause the system to disclose its internal system prompt" can be paraphrased and expanded into hundreds or thousands of variants. This enables coverage of paraphrase-based attacks that cannot be blocked by surface-level keyword filters.
Robustness is a perspective that measures whether an AI system degrades severely in response to unexpected inputs. While boundary value analysis is a fundamental technique in classical software testing, AI systems face an enormous variety of natural language-specific variations — such as Japanese mixed with English, colloquial Thai expressions, differences in numerical units, and extremely long inputs.
In synthetic testing, a single base input is prepared and systematically transformed (language switching, typo insertion, verbosity expansion, condensation, etc.) to generate dozens to hundreds of variants. If responses vary significantly for inputs with the same intent, the AI system has low robustness. In our company's operations, which provide multilingual services, we frequently synthesize inputs with the same intent across four languages — Japanese, English, Thai, and Lao — and visualize the quality differences across languages.
The practical approach to synthetic testing is to launch it incrementally, following the flow of "test case generation → scoring → operational integration." Rather than aiming for 100% coverage all at once, start running the evaluation loop from high-risk areas and work toward integrating it into CI/CD and regression detection.
Step 1 is defining the evaluation targets and quality criteria. Summarize on a single page "which AI features, from which perspectives (functional, safety, robustness), and what score constitutes a passing threshold." If this step is left ambiguous before moving on to generation, it becomes impossible to later judge whether test cases are good or bad.
Step 2 is synthetic test case generation. Generation methods can be broadly divided into three categories:
In practice, the hybrid approach is the most manageable. Always attach expected outputs (or pass/fail criteria) to generated cases so that the downstream Judge can score them.
Step 3 is automated scoring using LLM-as-a-Judge. Feed the hundreds to thousands of inputs generated through synthetic testing into the AI, and score the responses using a Judge prompt. The scoring dimensions should align with the quality criteria established in Step 1. For details on the accuracy of the Judge itself, please refer to the separate article.
Step 4 is integration into ongoing operations. Synthetic testing is not a one-time run; it should be embedded as a regression test to be executed whenever prompts are changed, models are updated, or the knowledge base is updated. Once the practice of "blocking production deployment if below the passing threshold" is established in CI, it becomes possible to move away from relying on individual judgment for quality decisions. Test cases themselves should also be added regularly, and any issues discovered in production must be incorporated into the synthetic tests (incident-driven coverage improvement).
Synthetic testing yields no results simply by being introduced — teams frequently stumble over biases in synthetic data and the operational design of evaluation metrics. We organize the typical failure patterns observed through our company's AI implementation support work, along with strategies to avoid them.
The most common failure is "judging quality using synthetic data alone." Test cases generated by an LLM tend to be biased toward expressions that the LLM handles well. This creates a gap where a system "passes synthetic tests but fails on the natural Japanese used by real production users."
The countermeasure is to always evaluate using both synthetic and real data, and to monitor the difference in scores. If the gap exceeds a certain threshold, treat it as a sign that the synthetic data distribution has drifted from production, and regenerate accordingly. In practice, a workable approach is to sample 50–100 cases from production logs each quarter and score them in parallel with the synthetic tests.
Another typical failure is the pattern where impressive numbers are produced during the PoC phase, but no one runs the tests once the system moves into production. There are two causes: first, the Judge cost is too high to run daily; second, the passing threshold is set so strictly that tests constantly fail, causing teams to become desensitized.
The countermeasure is to split the evaluation set into a "full" version and a "smoke" version. The smoke version contains only 30–50 representative cases and runs automatically on every deployment. The full version is used for release decisions on a weekly or monthly basis. Additionally, set the passing threshold loosely at first with the explicit plan to raise it gradually — tightening it only after the process is established — to prevent teams from becoming numb to constant failures.
Q1: How should synthetic tests be used alongside traditional unit tests and e2e tests?
Unit tests and e2e tests are methods for verifying the correctness of deterministic software, where inputs and expected outputs are fixed. Synthetic tests evaluate the quality "distribution" of AI systems, which are inherently non-deterministic, with passing defined as "achieving the target score rate." Using both in the same project is standard practice — a clean division where synthetic tests cover the AI components and traditional tests cover the surrounding APIs tends to be the most manageable approach.
Q2: Can the LLM used to generate synthetic data be the same model as the AI being evaluated?
This is not recommended. If the same model is used for both generation and response, the model's idiosyncratic tendencies will appear in both the test inputs and the outputs, making it impossible to detect the weaknesses that should actually be caught. It is preferable to use a three-layer separation: a different lineage (different vendor or different generation) for test case generation versus the production model, and yet another separate model for the Judge.
Q3: How many test cases should be prepared for synthetic testing?
It depends on the breadth of the use case, but a good starting point is approximately 300 cases: 100 for functional quality, 100 for safety, and 100 for robustness. A smoke version of 30–50 cases for automated runs on every CI cycle, and a full evaluation targeting 1,000 cases to be reached within six months to a year, aligns well with real-world operational experience.
Q4: For multilingual services, should synthetic tests be created separately for each language?
Yes, they should. Even with the same intended input, a model's response quality and safety behavior can vary significantly across languages. We recommend running synthetic tests in parallel across all languages supported by the service — such as Japanese, English, Thai, and Lao — and visualizing score differences between languages.
A synthetic test is a mechanism that uses LLM-as-a-Judge to automatically score test cases built from synthetic data, enabling continuous evaluation of an AI system's functional quality, safety, and robustness. It covers the non-determinism and edge cases that real data alone cannot address, and serves as a foundation supporting regulatory compliance, regression detection, and operational quality simultaneously.
The basic implementation follows four steps: "define what to evaluate → generate synthetic test cases → score with a Judge → integrate into CI/CD." With careful attention to bias in synthetic data and to sustaining operations over time, it is possible to move beyond a PoC and apply this approach to production quality assurance. For details on Judge design and how to structure Judge prompts, refer to "What is LLM-as-a-Judge? A Guide to Evaluating AI Output with AI and Implementing Hallucination Detection."
We support continuous quality assurance after AI adoption through evaluation loops that combine synthetic testing with LLM-as-a-Judge in this way. If you want to move your AI system beyond a one-off PoC and into sustained production operation, we recommend consulting with us from the evaluation design stage.

Yusuke Ishihara
Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).