

Even though AI agents have become capable of autonomously handling tasks, there are limits to entrusting everything to a single agent.
Consider the instruction "create a competitive analysis report." Web search, data collection, analysis, writing, chart generation, review, revision——when a single agent handles all of these end-to-end, context window consumption becomes intense, mistakes made along the way propagate to the final output, and it becomes difficult to pinpoint where things went wrong.
Just as human teams naturally do, dividing responsibilities works better. A research role, an analysis role, a writing role, a review role. Each role focuses on its own specialty and passes deliverables down the line. Multi-agent architecture applies this concept to AI systems.
Role design in multi-agent systems varies widely depending on the use case, but there are four fundamental roles common to many implementations. Let's start with Planner and Executor.
The Planner is the command center that breaks down a user's goal into Issues and determines the execution order. The Executor actually carries out those subtasks. It handles interactions with external tools such as web searches, API calls, and code execution, and multiple specialized Executors can be prepared for different types of tasks. These two roles are clearly defined and rarely cause confusion during implementation.
The challenge lies in designing the Critic. The Critic evaluates the Executor's output, determines whether quality meets the standard, and returns feedback. It plays a role equivalent to a human code reviewer, but if the prompt does not explicitly state "what constitutes a pass," the system can fall into a situation where the Executor and Critic loop through revisions indefinitely. If the Critic is introduced with vague evaluation criteria, the Executor keeps asking "How about this?" while the Critic keeps responding "Still not good enough," consuming nothing but tokens. It is an iron rule to write the Critic's evaluation criteria into the prompt as concrete pass conditions (e.g., "the code passes the tests," "includes three or more supporting reasons").
The Verifier is easily confused with the Critic, but the perspective differs. While the Critic examines the quality of individual subtasks, the Verifier confirms whether the final deliverable meets the original goal. For an analytical report, it plays the role of verifying overall consistency—such as "Does it answer the client's question?" and "Are the conclusions and supporting evidence coherent?"

Depending on how roles are combined, several established patterns exist. If you're unsure which to choose, starting with the pipeline pattern is recommended.
The simplest configuration, where data flows unidirectionally through Planner → Executor → Critic → Verifier. Well-suited for tasks with clearly ordered steps, such as report generation or document creation. Easy to implement and straightforward to debug. However, if a major change in direction becomes necessary midway through, there is a risk of having to start over from the beginning.
A configuration in which the Executor and Critic iterate until a quality standard is met. Commonly used in code generation use cases: the Executor writes code, the Critic reviews it, and if issues are found, the Executor makes corrections. Convergence typically occurs within 3–5 iterations, but setting a maximum iteration limit is essential to prevent infinite loops.
A configuration in which a top-level Planner manages multiple lower-level agent teams. For large-scale projects—such as developing an entire application—each team (frontend, backend, testing) can be given its own Planner + Executor pair, with an overarching Planner coordinating the whole. Claude Code's Agent Team feature and CrewAI adopt this pattern.
A configuration in which multiple agents independently generate answers to the same problem, which are then compared and consolidated. Effective for problems without a single definitive answer, such as strategy formulation or creative tasks. The final answer is determined through methods such as majority vote, scoring, or integration by a separate agent. However, since costs run 2–3 times higher than other patterns, this approach is best avoided from the outset unless there is a clear justification for the accuracy gains.

In the previous section, I introduced four design patterns, but regardless of which pattern you choose, deferring the design of inter-agent communication will inevitably cause problems down the line. It's an easily overlooked point, but it's the most critical one.
When agents communicate using free-form text, misinterpretations arise. When a Planner instructs "retrieve the user list," the Executor is left wondering: should it fetch all records, only active users, or which fields to include — this ambiguity leads to inconsistent quality.
In practice, it is recommended to define structured schemas (such as JSON Schema) for inter-agent messages. Google's A2A (Agent-to-Agent) protocol is being developed as a standard specification for communication between agents from different vendors. Anthropic's MCP (Model Context Protocol) standardizes the connection between agents and external tools, and integration with major IDEs and LLM clients is already well underway. A2A, on the other hand, is still at the stage where early adopters are experimenting with it, and production deployments are yet to come.
That said, for smaller-scale systems, there is often no need to introduce such protocols at all — simply defining input and output types with JSON Schema is sufficient in many cases. What matters is making explicit "what each agent expects from the other," not the formality of the tools you use.

Multi-agent systems are powerful, but the first thing I want to convey is this: don't build a full configuration from the start. It's not uncommon to see teams assemble a setup like Planner + Executor × 5 + Critic + Verifier right out of the gate, only to collapse under the weight of maintenance costs. Start by running tasks with a single agent, identify the steps where quality falls short, and only then split those specific steps into separate roles. This is the approach least likely to fail.
With that said, here are three challenges you'll face once you move into production.
The first is cost amplification. The more agents you add, the more LLM API calls you make. Running a Critic-in-the-loop configuration with three agents over five iterations can cost hundreds of yen per task when using GPT-4-class models. Estimates should be based on the worst case (maximum iterations × number of agents), multiplied by the expected monthly task volume to get a realistic sense of your budget.
The second is the difficulty of debugging. When something goes wrong, it's hard to pinpoint which agent and which step introduced the error. It's essential to record each agent's inputs and outputs as structured logs. In particular, when the Critic has judged something as "OK" but the final output is still wrong, tracing back through the Critic's input logs tends to lead you to the root cause. Observability tools like LangSmith or Braintrust can visualize the entire trace.
The third is preventing runaway behavior. There's a risk that the Planner generates subtasks without bound, or that Executors keep handing tasks off to one another and fall into an infinite loop. Timeouts, maximum step counts, and budget caps — I recommend setting all three of these guardrails at a minimum, no matter how small the system.

As a business scenario where multi-agent systems prove effective, let's take a deeper look at customer support automation.
A typical configuration follows a four-stage structure: Triage Agent (query classification) → Research Agent (knowledge base search) → Draft Agent (response drafting) → Review Agent (quality check). Human operators handle only the final approval. While this looks clean and straightforward on paper, in practice the first stumbling block is the Review Agent. If its evaluation criteria are too lenient, responses generated by the Draft Agent—including those containing factual errors—will pass through unchecked. Even an obvious rule like "customer-facing responses must not contain factual errors" won't function unless specific checklist items are explicitly written into the Review Agent's prompt.
This configuration has also been applied to software development. Following the flow of Planning Agent → Coding Agent → Testing Agent → Review Agent, Claude Code and Devin are practical examples of this pattern. In market research reports as well, a Data Collection → Analysis → Writing → Fact-Check pipeline can compress a process that would take days by hand into just a few hours.
In all of these cases, a fully autonomous multi-agent system is technically feasible to build. However, in situations involving business judgment, the responses an agent determines to be optimal often ignore organizational context, established conventions, and stakeholder sensitivities. "Technically correct" and "appropriate for business" are two different things. That is precisely why embedding human review points into the system remains, at this stage, the most pragmatic operational approach.
Yusuke Ishihara
Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).
Chi
Majored in Information Science at the National University of Laos, where he contributed to the development of statistical software, building a practical foundation in data analysis and programming. He began his career in web and application development in 2021, and from 2023 onward gained extensive hands-on experience across both frontend and backend domains. At our company, he is responsible for the design and development of AI-powered web services, and is involved in projects that integrate natural language processing (NLP), machine learning, and generative AI and large language models (LLMs) into business systems. He has a voracious appetite for keeping up with the latest technologies and places great value on moving swiftly from technical validation to production implementation.