AI Agent Evaluation Design Guide — Tool Calling, Execution Traces, and Regression Detection

AI Agent Evaluation Design Guide — Tool Calling, Execution Traces, and Regression Detection

Lead

AI agent evaluation is a multi-layered verification process that examines not only the quality of final answers, but also the accuracy of tool calls and the validity of execution trajectories.

In standard LLM evaluation, it is sufficient to score a one-to-one correspondence between input and output. However, because agents achieve their goals through multi-step execution while calling external tools, there are cases where the final output appears correct even when intermediate decisions were wrong. Overlooking these "apparent correct answers" carries the risk of serious regressions in production environments.

This article provides a systematic explanation of every step in agent evaluation—from golden dataset design to tool call verification, trajectory scoring, regression detection under non-determinism, and CI integration. It is intended to give engineers and MLOps practitioners who are considering building an evaluation infrastructure the concrete steps they need to get started tomorrow.

Conclusion: AI agent evaluation must verify not only the accuracy of the final output, but also the appropriateness of tool calls and the validity of execution trajectories.

A fundamentally different approach from standard LLM evaluation is required. The following H3 sections explain why, along with the specific evaluation perspectives involved.

The Need to Examine "Execution Trajectories," Not Just Final Outputs

It is tempting to assume that "if the final answer is correct, the evaluation is sufficient." In practice, however, there are cases where an answer appears correct but was reached through a flawed process—and this can lead to unexpected failures or erroneous operations in production environments.

AI agents achieve their goals by calling multiple tools. As a result, conventional LLM evaluation that measures only the quality of the final output will miss judgment errors that occur along the way.

For example, an agent that returns the correct answer after redundantly calling a search tool five times and an agent that returns the correct answer with a single appropriate call will receive the same final output evaluation score. Yet the former carries risks of cost overruns, increased latency, and unintended side effects.

There are three main reasons to evaluate execution trajectories:

  • Detection of tool misuse: Even when the correct answer is reached, unnecessary tool calls or incorrect argument passing may be lurking beneath the surface.
  • Understanding side-effect risks: In operations involving writes to external APIs or databases, errors in intermediate steps can cause irreversible damage.
  • Ensuring efficiency: The number of steps, token consumption, and number of tool calls are directly tied to operational costs.

As explained in detail in What Is Harness Engineering? A Design Methodology for Structurally Preventing AI Agent Mistakes, visibility at the trajectory level is a prerequisite for structurally controlling agent behavior.

Why Non-Determinism and Multi-Step Processes Make Evaluation Hard

Even when the same prompt is executed ten times, an AI agent may generate a different tool call order or intermediate outputs each time. This is the problem of "non-determinism." In standard LLM evaluation, it is enough to compare whether the output matches the expected output. With agents, however, multiple valid solution paths exist, making simple string matching ineffective.

When multi-step reasoning is added, the complexity increases further.

  • Error propagation: If the wrong tool is called in step 1, the input to step 2 and beyond becomes contaminated, which can cause the final answer to appear correct by coincidence.
  • Evaluation explosion: For a 5-step agent where each step has 3 valid solution paths, the number of combinations to verify grows exponentially.
  • Invisibility of intermediate states: Side effects on external APIs or databases cannot be detected simply by examining the final output.

The perspective of conditional branching is also important. For agents with few steps and limited paths, comparison against a deterministic expected trajectory is effective. For agents with many steps and diverse paths, LLM-based trajectory evaluation or statistically designed pass/fail thresholds become the more practical options.

If evaluation is designed around "final answer accuracy" alone without understanding this structural difficulty, the risk of releasing to production with undetected tool misuse or wasteful calls increases significantly. As a starting point for evaluation design, it is important to explicitly map out both axes—non-determinism and number of steps.

How to Design an Overall Agent Evaluation Framework

Conclusion: In agent evaluation, it is essential to first clarify "what to measure, at what granularity, and when."

By dividing evaluation targets into offline and online categories and designing each layer—final answer, tool calls, trajectory, and cost—a thorough verification framework with no gaps can be established.

Roles of Offline and Online Evaluation

Offline evaluation is equivalent to "pre-shipment quality inspection," while online evaluation corresponds to "post-launch defect monitoring." Combining both prevents gaps in evaluation coverage.

Offline evaluation is conducted using a fixed golden dataset without relying on production traffic. Because it can be integrated into a CI pipeline and run automatically with every code or prompt change, it is well-suited for early detection of regressions. The primary targets are as follows:

  • Tool selection accuracy (whether the correct tool is being called)
  • Argument validity (types, value ranges, required fields)
  • Execution trajectory order (whether steps are arranged as expected)
  • Final answer quality (LLM scoring or rule-based matching)

Online evaluation, on the other hand, targets actual user requests and continuously analyzes logs and traces from the production environment. Its primary role is to capture edge cases that are difficult to reproduce offline, as well as changes in data distribution (concept drift).

The division of roles between the two can be summarized as follows:

PerspectiveOffline EvaluationOnline Evaluation
TimingPre-deploymentPost-deployment / Continuous
DataGolden setProduction logs
Primary detection targetsRegressions / Spec deviationsDrift / Unknown patterns
CostLow–MediumMedium–High

An important point is that even agents that pass offline evaluation may exhibit unexpected tool-chaining behavior in production.

Evaluation Layers (Final Answer, Tool Calls, Trajectory, Cost)

It is tempting to think that "checking the accuracy of the final answer is sufficient," but in practice, multi-layered evaluation that includes tool call validity and execution trajectories enables earlier detection of problems.

It is effective to design AI agent evaluation across the following four layers:

  • Final Answer Layer: Verifies that the output to the user is accurate and free of hallucinations. The same techniques used in conventional LLM evaluation (exact match, LLM-as-Judge, etc.) can be applied, but this alone is insufficient.
  • Tool Call Layer: Verifies which tools were called, with what arguments, and how many times. For example, an inefficient pattern such as "calling the search tool 5 times with the same query" is problematic even if the final answer is correct.
  • Trajectory Layer: Scores whether the order of steps, branching, and interruption decisions match expectations. Execution logs are compared against the golden set and scored on a per-step basis.
  • Cost / Resource Layer: Measures token consumption, number of API calls, and latency. Agents whose costs exceed budget cannot be deployed to production, so it is important to monitor this in conjunction with the consumption management practices introduced in What Is the Token Trap? Practical Consumption Management to Prevent Hidden Cost Explosions in AI Agents.

The reason all four layers are evaluated simultaneously is that each layer can degrade independently.

Step 1: How to Prepare a Golden Dataset

Conclusion: The quality of evaluation is determined by the design of the golden dataset. Preparing a three-part set consisting of inputs, expected tool sequences, and expected outputs is the starting point.

Unlike standard QA, agent evaluation requires cases that define not only the expected answer but also "which tools to call and in what order." The following H3 sections explain the design methodology and continuous updates from operational logs in detail.

Agent-Specific Test Case Design (Inputs, Expected Tool Sequences, Expected Results)

Teams in the PoC stage often hit the wall of "wanting to build a golden dataset but not knowing what to define or how far to go."

Unlike standard LLM tests, agent evaluation cases require three elements to be defined as a set:

  • Input: The user's utterance, context, and a list of available tools
  • Expected Tool Sequence: The names and order of tools to be called, along with the argument schema for each step
  • Expected Output: The requirements for the final answer (often defined as "information that should be included" rather than an exact match)

The granularity of the expected tool sequence is key. When defining "search → aggregation → answer generation" as the correct sequence, it is necessary to decide in advance whether cases where an unnecessary API call is inserted in the middle, or cases where aggregation is skipped, should be treated as "partial credit."

As a concrete example, an inventory inquiry agent could be defined as follows:

Generating and Updating Evaluation Cases from Production Logs

A golden set designed by hand is, in a sense, a "snapshot." Because the execution trajectories that agents actually follow in production change day by day, evaluation cases will quickly become outdated unless operational logs are continuously incorporated.

The following workflow is effective for converting operational logs into evaluation cases:

  • Log collection: For each request, record the input, tool call sequence, arguments, and final output as a set.
  • Filtering: Prioritize extracting cases where users explicitly provided negative feedback, cases where tool call errors occurred, and cases with an abnormally high number of steps.
  • Labeling: Annotate the extracted logs with the expected tool sequence and expected output—either manually or with LLM assistance—and add them to the golden set.

As a guideline for update frequency, it is practical to use two triggers: when a model or prompt is changed, and when a certain volume of production traffic has accumulated. Since re-annotating all cases from scratch every time a change is made is labor-intensive, an "incremental update" approach—adding only the differential cases—is recommended.

The following points should be noted when utilizing logs:

Step 2: How to Evaluate the Correctness of Tool Calls

Conclusion: Evaluating tool calls along three axes — selection, arguments, and order — is the foundation of quality assurance.

Once a golden dataset is in place, the next step is to scrutinize individual tool calls. Even if the final answer is correct, latent risks remain if incorrect tools or arguments were used internally.

Validating Tool Selection, Arguments, and Call Order

Validating tool calls requires independently evaluating three axes: "what was called," "with what arguments," and "in what order." Even if the final answer is correct, errors along any of these axes will degrade system reliability.

Tool selection validation compares the expected tool name against the actually invoked tool name using exact match or normalized match.

  • Expected: search_db → Actual: search_db
  • Expected: search_db → Actual: search_web (an alternative tool was selected) ✗

When multiple tools exist, whether selecting a semantically similar alternative tool counts as "correct" must be defined in advance through an evaluation policy.

Argument validation requires more fine-grained evaluation than tool selection. When arguments are structured data (JSON), it is effective to compare key names and value types/ranges rather than requiring an exact match. For free-form arguments (such as search queries), a semantic similarity score with a defined threshold should be used to determine pass/fail rather than requiring an exact match.

Call order validation confirms that steps with dependencies are sequenced correctly. For example, if the order "retrieve user information, then search order history" is reversed, there is a risk that subsequent steps will reference a user ID that does not yet exist. The correctness of the order can be quantified using edit distance (Levenshtein Distance) or subsequence matching.

The strictness of evaluation should be adjusted depending on the case.

Detecting Unnecessary Calls and Loops

After verifying tool selection and argument correctness, what is often overlooked in practice is the problem of "redundant calls" and "infinite loops." Have your tests confirmed that the agent is not repeatedly calling the same tool or executing unnecessary searches multiple times?

Redundant calls refer to cases where tool executions unnecessary for task completion are mixed in. For example, if a simple calculation task calls an external API 7 times when only 3 calls are needed, costs and latency balloon unnecessarily.

The following metrics are used for detection:

  • Call count ceiling check: Issue a warning when the actual number of calls exceeds a certain ratio relative to the expected call count in the golden set
  • Duplicate call detection: Detect consecutive calls to the same tool with the same arguments from logs
  • Invalid transition detection: Identify cyclic patterns such as Tool A → Tool B → Tool A using a trajectory graph

Loop detection is especially important. A "stuck loop," in which an agent continues to retry the same call after receiving an error response, is a classic failure pattern where the context window is consumed without ever reaching a final answer.

There are two key implementation points:

  • Set a maximum step count (e.g., 20 steps) in the evaluation harness, and automatically treat any case that exceeds it as a failure
  • Explicitly incorporate loop detection logic as assertions in the golden set

[What is a token trap?

Step 3: How to Score Execution Trajectories

Conclusion: Scoring execution trajectories is fundamentally a combination of a step-level approach that measures the degree of match against a correct trajectory, and flexible evaluation using an LLM as a judge.

The role of trajectory scoring is to evaluate the "connections between steps" that cannot be captured by individual tool call validation alone. The following H3 sections explain in detail how to use trajectory matching and LLM-as-judge appropriately.

Trajectory Matching and Step-Level Scoring

It is tempting to think that simply evaluating whether a trajectory exactly matches the golden trajectory is sufficient. In practice, however, scoring which steps are correct and which have broken down at a step-by-step level enables a faster improvement cycle.

Trajectory matching is a technique that compares the sequence of steps an agent actually took against the expected sequence of steps in the golden dataset. The granularity of comparison falls into two broad levels.

Exact Match

  • Determines in binary whether tool names, arguments, and call order all match
  • Low implementation cost; well-suited for regression detection equivalent to unit testing
  • Prone to false positives due to subtle expression variations in arguments (e.g., string normalization differences)

Step-Level Scoring

  • Scores each step independently and aggregates the scores across the entire trajectory
  • Example scoring axes: correctness of tool selection (0/1), semantic match of arguments (0–1), appropriateness of call timing (0/1)
  • The final score is typically designed as "number of correct steps ÷ number of expected steps" as a baseline, with a penalty added for unnecessary extra steps

Introducing step-level scoring makes it possible to visualize partial correctness, such as "the overall run failed, but the first 3 steps were correct." This narrows the focus of debugging and clarifies exactly where prompts or tool definitions need to be revised.

As an implementation note, measuring argument semantic similarity using string exact match alone can make evaluation overly strict.

When to Use LLM-Based Trajectory Evaluation

There are cases that rule-based trajectory matching cannot handle. For example, when multiple patterns of tool call sequences exist to achieve the same goal, all patterns must be treated as "correct answers." LLM-based trajectory evaluation proves its worth in situations that require this kind of flexible scoring.

The cases where LLM evaluation should and should not be used are relatively clear-cut. If you want to judge the validity of a trajectory at the "semantic level," LLM evaluation is effective; if you only need to verify exact matches of tool names and arguments, rule-based approaches are faster and more stable.

There are three main use cases. The first is tolerating alternative paths—determining whether a different sequence of steps from the expected trajectory still leads to the correct final result. The second is detecting redundant steps—evaluating in context whether loops or unnecessary retries are present. The third is qualitative scoring of partial failures—assigning graduated scores to cases where an error occurred mid-step but was recovered from.

As an implementation note, it is important to explicitly provide scoring criteria to the evaluation LLM via a system prompt. Passing specific evaluation axes such as "was each step necessary to achieve the goal?" and "were there any unnecessary tool calls?" makes it easier to reduce variance in scoring. In addition, since LLM evaluation itself is non-deterministic, it is recommended to score the same trajectory multiple times and record the distribution of scores.

Step 4: How to Detect Regressions Under Non-Determinism

Even if you run the same prompt 10 times, you get a different answer each time. This is the fundamental difficulty of AI agent evaluation.

In traditional software testing, "same input → same output" is a given, but that assumption breaks down with LLM-based agents. Even if one run passes, the next run might take a different path and fail. If you're monitoring for regressions based solely on pass/fail results from a single execution, you'll miss this kind of unstable degradation entirely.

So how do you detect it? The basic approach is a combination of "multiple trials + statistical thresholds." Run the same case multiple times, and only flag a "regression" when the success rate falls below a certain threshold. For example, you might set a criterion like "must succeed at least 8 out of 10 times, otherwise NG." The threshold level should vary based on the importance of the task. For critical tool calls, 90% or higher may be required, while 70% might be acceptable for auxiliary summarization processes.

When integrating into CI, running the full test suite every time is costly and time-consuming, so it's practical to define a separate subset specifically for regression detection. Prioritize cases that have failed at least once in the past, edge cases prone to behavioral changes, and flows with significant business impact. Running this subset as a lightweight pass and switching to the full suite only when suspicious results appear creates a two-stage approach that balances speed and coverage.

Multiple Trials and Flakiness Mitigation (Designing Pass/Fail Thresholds)

Deciding pass or fail for a highly non-deterministic agent based on a single execution is like rolling a die once to judge quality. Even with the same prompt, each run can produce different tool call sequences and intermediate outputs, making it impossible to distinguish from a single trial whether something is "truly broken or just an outlier."

Ensuring Stability Through Multiple Trials

An effective approach is to run the same test case multiple times and calculate a score based on the pass rate.

  • Guideline for number of trials: Run lightweight unit-equivalent tests a smaller number of times; secure more runs for important scenarios
  • Designing pass rate thresholds: Vary the threshold based on case importance, such as "PASS if a high proportion of runs succeed out of a set number"
  • Flaky determination: Flag cases with a borderline pass rate as "flaky" rather than treating them as immediate failures—route them to an investigation queue instead

Guidelines for Designing Pass/Fail Thresholds

Thresholds should be differentiated by case type.

Case TypeRecommended Threshold
Safety & compliance100% (FAIL on even a single failure)
Core business flowsHigh pass rate required
Exploratory & creative tasksSome variance tolerated

An asymmetric design—requiring 100% for safety-related cases without exception while tolerating variance for creative tasks—is the practical approach. It is recommended that specific thresholds be determined through your own validation, based on your organization's risk tolerance and the importance of each case.

CI Integration and Regression Budgets

It's tempting to think that "manually running evaluation scripts locally is sufficient," but in practice, integrating them into a CI (Continuous Integration) pipeline and running them automatically per pull request is far more effective for early regression detection.

Key points to keep in mind when integrating into CI are as follows:

  • Separating fast lanes and slow lanes: Running all cases every time inflates cost and time. A practical approach is to run a "smoke test" targeting only core scenarios on pull requests, and execute the full suite in a nightly build
  • Embedding flaky determination thresholds into CI configuration: Explicitly define the "pass M out of N runs" threshold designed in the previous section as the CI pass/fail condition. By treating a build as failed only when the threshold is exceeded, you can reduce false positives caused by noise
  • Setting a regression budget: Define an upper limit on token consumption, API call count, and execution time per CI run as a "regression budget." When the number of cases exceeding the budget increases, treat it as a signal to revisit the golden set

The reason for establishing a regression budget is to prevent the cost of evaluation itself from growing without bound. Tool call evaluation and trajectory scoring may use an LLM as the evaluator, meaning the evaluation phase itself carries the risk of cost explosion—as described in What Is the Token Trap? Practical Consumption Management to Prevent Hidden Cost Explosions in AI Agents.

Common Pitfalls and How to Avoid Them

Conclusion: Pitfalls in evaluation design often stem from a mismatch in "what is being measured." It is important to understand typical failure patterns and take preventive measures in advance.

Even with an evaluation framework in place, oversights can occur due to data staleness or scope mismatches. Each H3 section provides a concrete explanation of representative failure examples and how to address them.

Continuing to Use Evaluation Data That Has Drifted from Production Logs

"Evaluation scores are high, yet unexpected tool calls occur frequently in production"——does this situation sound familiar?

The root cause of this gap is often that evaluation data does not reflect actual user behavior in production. If you continue using a golden set manually created at the PoC stage, the divergence from new input patterns accumulating in production logs will keep widening.

There are three main triggers for degradation. The first is input distribution shift: early on, the expected question formats dominate, but as operation progresses, abbreviations, colloquialisms, and compound tasks increase. The second is tool specification changes: external API response formats change, yet the expected values in evaluation cases are left unupdated. The third is the failure to reflect new features—even after adding new tools to an agent, regression tests continue to run against only the old golden set.

To sum up the countermeasure to these problems in a single phrase: "treat evaluation data as a living organism." The specific operational approach is as follows.

Focusing Only on Final Answers and Missing Tool Misuse

An answer that appears correct may have been generated through a dangerous process——this is like a physician's report showing the right conclusion while having skipped all examinations. When only the final output is used as an evaluation metric, you will keep missing these "correct answer, wrong path" cases.

As a typical failure pattern, consider first the case of "reaching the correct answer by calling unnecessary tools multiple times." The final answer passes, yet latency and cost exceed acceptable limits in the production environment. Next, in the case of "accidentally returning the correct result despite passing wrong arguments," the same code breaks the moment an external API specification changes. Furthermore, when "routing through a tool that should not be used," calls that cross security boundaries go unlogged and become a problem during audits. Because all of these accumulate while final answer scores remain high, they cannot be detected by conventional accuracy evaluation.

So how do you detect them? An effective approach is to add a tool call log verification step to the evaluation harness. Mechanically checking these three points alone will prevent most oversights: the match rate between the expected tool sequence and the actual call sequence, argument schema conformance for each call, and whether the call count exceeds the upper limit.

As explained in What is Harness Engineering? A Design Methodology for Structurally Preventing AI Agent Mistakes, structuring the evaluation harness itself enables automated detection of tool misuse. An evaluation design that tracks only the correctness of the final answer makes surface-level quality appear high while quietly accumulating operational risk.

Frequently Asked Questions (FAQ)

Q1. What size should I start with for a golden dataset?

Starting with around 20–50 cases is practical. First, prioritize collecting "failure-prone patterns" from production logs and PoC results, and build out the evaluation pipeline with a small set. Afterward, an approach of gradually expanding it by continuously incorporating operational logs tends to make it easier to maintain a balance between quality and effort.


Q2. How do I determine pass/fail for a highly non-deterministic agent?

A common method is to run the same case multiple times and judge whether the success rate exceeds a threshold (e.g., 4 out of 5 runs). Set the threshold according to business risk, and in CI, establishing a graduated regression budget—such as "warn if the threshold is missed once, fail if missed twice in a row"—can suppress flaky false positives.


Q3. Should I use an LLM to evaluate tool calls? How does it differ from a rule-based approach?

Rule-based approaches are faster and more stable for argument type checking and call order verification, making them easier to integrate into CI. On the other hand, "semantic validity of arguments" and "appropriateness of call decisions based on context" are difficult to cover with rules, so LLM-based scoring serves as a useful complement. A practical approach is to combine both, using rule-based checks as a primary filter and LLM scoring as secondary verification.


Q4. Are there standards or regulations related to evaluation design?

NIST AI RMF 1.

Conclusion

Conclusion: Evaluating AI agents requires a design that verifies not only the quality of the final answer, but also the correctness of tool calls and execution trajectories across multiple layers.

This article organized an overview of AI agent evaluation step by step. Here is a recap of the key points.

  • Golden dataset: Design with a three-part set of input, expected tool sequence, and expected result, and continuously update from operational logs
  • Tool call evaluation: Verify the accuracy of selection, arguments, and order, and include unnecessary calls and loops as detection targets
  • Trajectory scoring: Combine step-level scoring with semantic evaluation by an LLM
  • Regression detection: Control the impact of non-determinism through multi-trial flakiness countermeasures and pass/fail threshold design
  • CI integration: Automate release decisions by establishing a regression budget, and ensure quality with a shift-left mindset

Two commonly overlooked failure patterns were highlighted: continuing to use evaluation data that has diverged from production logs, and missing tool misuse by looking only at the final answer. Both can be avoided by being mindful of them from the early stages of evaluation design.

From the perspective of risk management required by the NIST AI RMF and the EU AI Act, mechanisms for continuously verifying agent behavior will become increasingly important going forward. We recommend starting with a small-scale golden set and CI pipeline, and progressively improving evaluation accuracy as operational data accumulates.

Author & Supervisor

Yusuke Ishihara

Yusuke Ishihara

Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).