
Harness engineering is a practical methodology for structurally preventing the recurrence of mistakes made by AI agents, achieved through the design of documentation, tools, and architectural constraints. The concept spread rapidly after Mitchell Hashimoto's publications and OpenAI engineering team's case studies emerged around the same time, further amplified by Martin Fowler's references to the approach.
This article explains the concepts, components, and implementation steps of harness engineering for engineers and managers who have already adopted AI agents in their operations or are considering doing so. For those facing the challenge of "agents that work but aren't stable," it presents concrete approaches that can be applied starting tomorrow.
Rather than rewriting prompts and hoping for the best, the goal is to build a system that structurally prevents the same mistake from ever happening again — that is the essence of harness engineering.
To borrow OpenAI's phrasing, this corresponds to the mindset of "adding missing capabilities rather than trying harder." When an agent's output differs from what is expected, adding a cautionary note to the prompt is nothing more than a symptomatic fix. Harness engineering aims to change the environment itself to the point where that failure becomes physically impossible to reproduce.
Prompt engineering is the optimization of "what to convey." By refining the structure of input text, the clarity of instructions, and few-shot examples, one can improve the quality of a model's output. Harness engineering, on the other hand, involves designing "the environment itself in which the agent operates." It is a comprehensive approach that encompasses documentation, tools, linters, tests, and CI/CD pipelines—of which the prompt is just one component.
The two are not mutually exclusive but complementary. A good prompt is part of a good harness, but there are failures that a prompt alone cannot prevent. For example, a rule such as "do not execute DROP TABLE against the production DB" is more reliably enforced by mechanically blocking it with a pre-commit hook than by writing it into a prompt.
Context engineering is a technical domain focused on optimizing the context passed to models—reference information, instructions, and tool definitions. Shopify CEO Tobi Lutke describes it as "the art of arranging all the context so that a task becomes solvable for an LLM." Harness engineering encompasses this while extending the design scope beyond context itself—to constraints enforced by linters, validation through tests, and quality gates via CI/CD. If context engineering defines "what the agent sees," harness engineering specifies "what the agent can and cannot do" at the environment level.
One important caveat: a larger context window does not necessarily improve performance. Research from Chroma has confirmed a tendency for model performance to degrade as context length increases. The accumulation of tool definitions and instructions becomes noise, causing agents to fall into a state of "knowing everything but doing nothing well"—the so-called Dumb Zone. In harness design, structure is prioritized over the volume of context, and Progressive Disclosure—revealing necessary information on demand—is considered an effective approach.
The more capable AI agents become, the greater the risk of them "occasionally breaking down." The improvement of capabilities and control mechanisms must evolve simultaneously.
AI agent demos are impressive. Watching a seamless flow of writing code, running tests, and creating PRs makes you want to adopt one immediately. But once you actually start running it with your team, it works as expected 8 out of 10 times — while the remaining 2 times it deletes unexpected files or introduces changes that ignore the existing architecture.
This "80% success, 20% destruction" state is difficult to resolve through prompt improvements alone. This is because agent failure patterns vary from session to session, making it impractical to exhaustively enumerate prohibited actions in a prompt.
Hashimoto drew on his own agent development experience to share the principle of "building a system every time a mistake occurs." Around the same time, OpenAI's engineering team also published insights gained from their internal agent operations. The reason both parties independently reached the same conclusion is that the practical deployment of agents had crossed a certain threshold, making "quality stabilization" a common bottleneck. When Martin Fowler referenced this, awareness spread throughout the broader software engineering community.
The components of harness engineering are categorized differently depending on the commentator, but can typically be organized into three layers.
OpenAI describes it along four axes—Context / Constraints / Feedback / Cleanup—while other practitioners use axes such as Tools / Docs / Feedback loops, meaning no unified framework exists. Here, we organize and explain the components into three layers that are practical and approachable for real-world use: documentation, tools, and constraints.
Describe the repository's invariants and coding conventions in documents that the agent references. For OpenAI, AGENTS.md serves this role; for Claude Code, CLAUDE.md does.
However, as Martin Fowler points out, the goal is not "maintaining large amounts of markdown." In OpenAI's practice, a roughly 100-line index file is kept separate from structured detailed documents (design documents, specifications, and execution plans). The key is to create a structure that allows the agent to reach the information it needs via the shortest possible path.
OpenAI engineer Ryan Lopopolo introduces a concept called "Taste Invariants." The idea is that when you feel "something is off" during a code review and can articulate why, you can write that reason down as a rule. For example, the frustration of "helper functions for concurrent processing are defined redundantly in multiple places" can be converted into a custom ESLint rule that "prohibits defining that function anywhere other than its official location." By transforming subjective preferences into definitive constraints, you inject "taste" into the agent.
We also operate a CLAUDE.md, and it is not uncommon for adding a single rule to raise the quality of the entire subsequent session. Conversely, we have experienced a decline in the agent's compliance rate once the rules exceeded around 500 lines, which has reinforced our conviction that "structural design" matters more than "the volume of what is written."
Instead of instructing the agent to "visually verify," pass it tools that automate verification. These include screenshot capture, test runners, linters, MCP servers, and similar utilities.
MCP (Model Context Protocol) is a standard protocol for connecting agents to external tools and data sources, and serves as the foundational technology for building the tool layer of a harness. For example, by connecting Supabase MCP, the agent can directly inspect the DB schema before writing queries, which structurally reduces mistakes such as "SELECT-ing a column that doesn't exist."
Three additional mechanisms are effective in the tool layer. Hooks are deterministic control flows inserted at agent lifecycle events (after a tool call, upon task completion, etc.) that produce no output on success (avoiding context pollution) and surface errors only on failure. Sub-Agents delegate individual tasks to isolated child agents, functioning as a "context firewall" that protects the parent agent's context. Back-Pressure is a mechanism that enhances an agent's self-verification capability—by suppressing output when tests pass and displaying results only on failure, it maximizes context efficiency.
Mechanically verify dependency directions between layers using a custom linter, making it physically impossible for agents to commit changes that break the structure. A guiding principle discernible from OpenAI's practices is to prioritize investment in deterministic linters, tests, and pre-commit hooks over non-deterministic prompt instructions.
This also connects to the concept of guardrails. While AI guardrails are "mechanisms that inspect and control model outputs," the constraint layer of a harness is "a mechanism that defines the permissible scope of agent actions in advance." Restricting behavior proactively is less costly than checking outputs after the fact.
How much improvement can be achieved by refining the harness alone, without changing the model? LangChain's evaluation on Terminal Bench 2.0 offers telling results. Using the same model, improvements made solely at the harness layer—including loop detection middleware (prompting the agent to reconsider its approach when edits to the same file exceed N times), pre-completion checklist middleware (requiring a validation pass before the agent terminates), and staged reasoning intensity switching (high reasoning cost during the planning phase, moderate during implementation, and high again during validation)—raised the score from 52.8% to 66.5%. No modifications were made to the model itself.
Anthropic has also proposed a different approach. The Generator-Evaluator pattern separates the agent responsible for implementation from the agent responsible for evaluating quality. Inspired by the structure of GANs (Generative Adversarial Networks), this design addresses the bias that arises when an agent self-evaluates its own output—tending to "praise even mediocre results with confidence"—by making evaluation independent, thereby ensuring objectivity. A sprint contract (explicit acceptance criteria) is defined prior to implementation, and the evaluator agent scores the output against those criteria.
Harness engineering does not need to start with large-scale infrastructure. It can be built up incrementally, one mistake eliminated at a time.
Every time an agent fails, record in a single line what happened, why it happened, and how it could have been prevented. The format doesn't matter — a GitHub Issue, Notion, or a plain text file will do. What's important is having the judgment standard that "if the same mistake is observed twice, it's a problem that should be prevented through a systematic fix."
Select the most frequent failure from the error logs and add it as a rule to CLAUDE.md (or AGENTS.md).
1# PROHIBITED: Do not delete files under src/lib/ (prevents destruction of core functionality)This single line significantly reduces the probability of the agent repeating that mistake in all subsequent sessions. Rather than trying to design a perfect documentation system upfront, building it up with a "one mistake, one rule" approach is more effective in practice.
Rules written in documentation that can be mechanically verified should be migrated to pre-commit hooks or custom linters. Documentation rules are "guidelines you want people to follow," whereas linters become "walls that physically block violations." This is the same concept as "shift left" in the DevOps context — moving the timing of problem detection to as early a stage as possible.
As agents become more widespread, the work of engineers is shifting from "writing code" to "creating an environment in which agents can operate correctly."
From OpenAI's practices, we can discern a direction in which engineers' work is shifting toward ensuring the stable operation of system infrastructure (CI/CD and telemetry), scaffolding repository structure and documentation, and scaling human effort through AI. This can also be described as a transition from an era where productivity is measured by lines of code to one where it is measured by the quality of output produced by agents.
Birgitta Böckeler, co-author with Martin Fowler, organizes the ways humans interact with agents into three modes. Outside the Loop (vibe coding) is a mode where you simply specify the desired outcome and hand everything off to the agent, carrying the risk that inefficient solutions accumulate over time. In the Loop is a mode where humans review each output one by one, but creates a bottleneck where human review cannot keep pace with the agent's generation speed. The recommended mode is On the Loop——focusing not on the outputs themselves, but on improving the harness. Rather than directly fixing an artifact when something feels off, you modify the harness to raise the overall quality of all subsequent outputs. Whether one can maintain this discipline is what determines whether harness engineering takes hold.
This also aligns with the concept of HITL (Human-in-the-Loop). Rather than humans executing every step manually, the role shifts to supervising, verifying, and approving the agent's work. Harness engineering is an investment aimed at reducing this cost of oversight.
If left unaddressed, the quality of AI-generated output gradually degrades. OpenAI addresses this by defining Golden Principles (guiding principles to uphold), conducting regular scoring, and generating automated fix PRs. What is particularly noteworthy is that as a result of introducing harness engineering, OpenAI actually increased the frequency of daily standup meetings. The faster agents generate output, the more significantly architectural patterns can shift between check-ins, so a 30-minute daily sync was established to detect directional drift early. Under this structure, more than 1,500 PRs were processed over five months, and per-person productivity is reported to have improved from an initial quarter-engineer equivalent to 3–10x.
A harness is not something you build once and leave alone — the practical approach is to design it with the assumption that it will need to be rebuilt each time agent capabilities improve. Anthropic's own evaluations have confirmed that the underlying assumptions of a harness can shift as models evolve; for example, upgrading a model (Sonnet → Opus) eliminated the need for context resets that had previously been required.
There are also pitfalls in the introduction of harness engineering itself. Excessive documentation and excessive constraints are typical failure patterns.
When CLAUDE.md exceeds 1,000 lines, it strains the agent's context window and causes important rules to become buried. OpenAI's own practices also recommend keeping the index to around 100 lines and separating details into separate files. The key design principle is not to "write everything in one file," but to "create a structure that allows the shortest path to the necessary information."
Adding prohibitions without limit will prevent agents from taking even safe actions. Constraints should be narrowed down to "high-frequency, high-impact mistakes," with everything else left as recommended guidance in documentation. The distinction between constraints and guidance is important — not everything needs to be a hard block.
Prompt engineering is a technique for optimizing "inputs to the model," while harness engineering is a practical methodology for designing "the entire environment in which agents operate." Prompts are a component of the harness, and the two are not mutually exclusive but rather complementary.
Rather, small teams find it easier to get started. You can begin by adding a single rule to CLAUDE.md, with no need for large-scale infrastructure setup. Each time a mistake occurs, you add one mechanism at a time, and the harness naturally grows over time.
As agent capabilities improve, some harnesses will likely become unnecessary. However, greater capability also means a broader range of actions, giving rise to new kinds of mistakes. While the content of harnesses may change, it is hard to imagine that the design philosophy of "preventing mistakes through structure" will ever become obsolete.
In Anthropic's evaluation, building the full harness (Generator-Evaluator pattern + sprint contracts + calibrated evaluation) took approximately 6 hours and $200, while a standalone agent without a harness completed the task in 20 minutes for $9. However, there was a significant difference in the quality of the deliverables. The cost judgment is proportional to the importance of "what you entrust to the agent." A full harness is unnecessary for throwaway scripts, but the investment is worthwhile for continuous generation of production code.

Harness engineering is a practical methodology for stabilizing AI agent quality through "structure" rather than "prayer." It involves incrementally building a knowledge foundation through documentation, automated validation through tools (including Hooks, Sub-Agents, and Back-Pressure), and behavioral constraints through architectural restrictions. LangChain's evaluation shows that harness improvements alone raised scores from 52.8% to 66.5%, demonstrating that quality can change significantly through environment design without changing the model.
The first step is small. Simply record one agent mistake and add a single rule to CLAUDE.md. Starting from there, by mechanizing high-frequency failures with linters and pre-commit hooks, the overall quality of agent operations across the entire team will steadily improve. What matters is maintaining the "On the Loop" stance recommended by Fowler and others——rather than directly fixing outputs, improve the harness to raise the quality of all subsequent outputs.

Yusuke Ishihara
Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).