
"We implemented AI, but headcount never actually decreased" — this is a common complaint heard from DX promotion managers. In most cases, the root cause lies in the order of the workflow. Rather than having humans do the work and AI check it, the correct approach is to have AI process it and humans check it. Companies that get this sequence wrong find that manpower becomes the bottleneck, and as workload increases, costs balloon exponentially.
This article covers everything from the fundamentals of proper HITL (Human-in-the-Loop) design, to the mechanics of confidence-based routing, industry-specific implementation case studies, and a 5-step framework for embedding it within your organization. AI receives the input, humans make the final call — this shift in mindset is the key to preventing business automation from ending as a one-off PoC and instead taking root across the organization.

HITL is a design pattern that incorporates human judgment into AI processing workflows. The key point lies not in "where" humans are placed, but in "what order" AI and humans are involved.
The correct HITL flow is: Input → AI processes → Confidence assessment → Human review → Output. The AI receives the input first, processes it, and then passes the result to a human. The human only needs to review and correct the AI's output.
Reversing this order—designing the system so that "humans receive the input first and have the AI check it"—means every input must be handled by a human. If there are 100 inquiries, 100 instances of human effort are required; if that grows to 1,000, you either need ten times the staff or accept a drop in quality. In other words, it doesn't scale.
On the other hand, if the system is designed so that the AI receives the input, high-confidence cases (70–85% of the total) are handled automatically, and humans only need to review the remaining 15–30%. Even if the workload increases tenfold, the human workload does not increase tenfold. This is the greatest advantage of HITL design.
At Unimon Co., Ltd., the most common failure pattern observed while supporting client companies with business automation is the "human → AI" design. At one client, when attempting to automate customer support email responses, the workflow had staff members first read the emails and draft replies, then had AI perform grammar checks and honorific corrections. As a result, the staff members' working hours barely changed.
This design was switched to an "AI → human" model. Incoming emails now trigger AI to automatically generate reply drafts, and staff members simply review the content and hit the send button. The number of cases handled per staff member per day increased from 40 to 120, and response quality (customer satisfaction score) also improved from 4.2 to 4.5.
The decisive difference lies in the "location of the bottleneck." In the human → AI model, humans become the bottleneck, and costs increase in proportion to workload. In the AI → human model, AI handles processing in a scalable manner, allowing humans to focus on exception handling.
There are three models based on the degree of AI and human involvement.
Human-in-the-Loop (HITL) is a model in which humans review and approve the results of AI processing. It is used in areas where mistakes cannot be tolerated, such as final judgments in medical diagnosis, loan application approvals, and legal document reviews. A human intervention rate of around 15–30% serves as a general guideline.
Human-on-the-Loop (HOTL) is a model in which AI autonomously carries out processing while humans serve as monitors watching over a dashboard. Humans only intervene when an alert is raised by anomaly detection. Factory quality control and network monitoring fall into this category.
Human-out-of-the-Loop (HOOTL) refers to full automation with virtually no human involvement. Spam filters and routine data transformation are examples of this. However, the range of tasks to which HOOTL can be applied is limited, and it is more realistic for most tasks to start with either HITL or HOTL.
What all models have in common is that it is the AI that receives the input first. No model includes a design in which humans process the input first.

Even if you understand the concept of HITL, actually embedding it within an organization is a different matter entirely. It's not uncommon for projects to show promising results in a PoC, only to fizzle out in production. The root cause lies in the intake process for inputs and the underlying mindset.
When humans are placed at the "entry point" of a business process, processing capacity scales proportionally with headcount. If the volume of work handled by 10 people doubles, 20 people are needed. When factoring in recruitment costs, training costs, and office costs, the cost of scaling grows not linearly, but exponentially.
Placing AI at the entry point changes this structure. AI processing capacity can be addressed through infrastructure scaling, and costs increase only logarithmically. GPU and cloud costs do rise in proportion to workload volume, but the slope never reaches that of labor costs.
Looking at real numbers: in one financial institution's KYC (Know Your Customer) process, the per-case processing cost was approximately $25 when humans were at the entry point. After switching to a HITL design with AI at the entry point, the cost dropped to $4 per case, and processing time was reduced from 3 business days to 4 hours. Humans now review only the high-risk cases flagged by AI—approximately 12% of the total.
Even if HITL is implemented technically, it won't take hold if the organization's mindset remains "humans should handle everything as a matter of course." What the author has repeatedly witnessed at client companies is a pattern where, out of anxiety about "is it really okay to leave this to AI," employees continue to manually review every single case even after deployment. This only adds to the cost of AI adoption without reducing the workload.
There are three mindset shifts required for successful adoption. First, establishing a shared premise that "it's only natural for AI to receive inputs." Second, redefining the human role as "specialists in exception handling." And third, trusting the feedback loop through which "AI accuracy improves through ongoing operation."
Companies that fail to make this shift will inevitably fall behind in competition with companies where AI receives inputs as standard practice—because gaps will emerge across all dimensions: processing speed, cost, and scalability. Conversely, only companies that successfully achieve this mindset shift will be able to make business automation truly "stick."
The HITL AI market is projected to grow from $5.4 billion in 2025 to $16.4 billion by 2030 (CAGR 24.9%). Gartner forecasts that 86% of enterprises will adopt AI agents by 2027, with the majority expected to incorporate some form of HITL design.
Notably, this growth is being driven not by "improvements in AI accuracy" but by "collaborative design between AI and humans." Even if a standalone AI achieves 95% accuracy, business-critical operations often demand 99.8% or higher. It is human oversight that bridges the remaining 4.8% gap, and multiple cases have been reported in which HITL design has raised overall accuracy to 99.8%.
The U.S. Treasury's AI Risk Management Framework, published in February 2026 and comprising 230 control objectives, also positions HITL as a "mandatory requirement for high-risk AI systems." From a regulatory standpoint as well, HITL is increasingly becoming not a "nice-to-have" but a "must-have" design element.

At the core of HITL design is "confidence routing." Having humans check every AI output is inefficient. A mechanism is needed to route between automated processing and human review based on confidence scores.
Confidence-based routing branches processing based on confidence scores (0–1.0) assigned to AI outputs. A typical threshold design uses three tiers.
A confidence score of 0.85 or above triggers "auto-approval". The AI's output becomes the final result as-is. Examples include reading standard fields on invoices and classifying obvious spam emails. Ideally, 70–85% of all cases should fall into this zone.
A confidence score of 0.50–0.85 triggers "human review". The AI presents candidate results, which a human then verifies and corrects. This covers cases requiring context-dependent judgment, such as extracting clauses from contracts or classifying customer inquiry intent.
A confidence score below 0.50 triggers "human processing from scratch". The AI's output is displayed as reference information, but a human makes the final determination independently. This tier handles complaints and unprecedented inquiries.
Thresholds should be adjusted based on the nature of the work. In domains where the cost of misclassification is high—such as healthcare and finance—raise the auto-approval threshold to 0.95 or above. Conversely, for tasks where errors carry low costs, such as internal document classification, the threshold may be lowered to around 0.75.
One often-overlooked aspect of confidence-based routing is the fallback mechanism for when AI responses are delayed. In production environments, situations where AI fails to respond within a given timeframe—due to model overload or network failures—will inevitably occur.
Timeout thresholds vary depending on the nature of the task. A few seconds for real-time chat support, a few minutes for batch processing, and so on. The key is to design the system so that once a threshold is exceeded, the request is automatically routed to a human queue. For chatbots, this means switching to a message like "Let me connect you with a representative," while for batch processing, it means adding the item to a manual review queue.
This design prevents a situation where "if the AI goes down, the service goes down too." A structure in which humans always function as a backup is the foundation of robust HITL design. Fallback activation rates should be monitored regularly, and if the frequency is too high, improvements to the model or infrastructure reinforcement should be considered.

HITL design can be applied across industries, but implementation patterns vary by sector. Here we introduce case studies from three domains—finance, manufacturing, and customer support—where a "AI receives input" design approach has delivered results.
J.P. Morgan's contract analysis system COiN (Contract Intelligence) is a prime example of HITL design. AI reads over 12,000 commercial loan agreements annually and automatically extracts key clauses — a task that previously required lawyers and loan officers to spend 360,000 hours per year.
The defining feature of this system is its "learning-reinforced HITL" approach. Corrections made by humans during review are automatically fed back into the model's training data, improving accuracy in subsequent iterations. In the first year of deployment, the human intervention rate stood at 35%, but dropped to 12% three years later.
What is significant is that the system was never designed to achieve perfect accuracy from the outset. By accepting an initial intervention rate of 35% and building in a feedback loop for continuous improvement, practicing lawyers felt no resistance to "collaborating with AI."
At one automotive parts manufacturer, HITL was introduced for visual inspection. Previously, 8 inspectors manually inspected approximately 5,000 parts per day. The defect miss rate was 0.3%, resulting in approximately 5,500 defective parts being shipped annually.
In the new AI-driven workflow, a camera photographs each part and an image recognition AI classifies it as acceptable or defective. Parts with a confidence score of 0.90 or above are automatically passed, while those below 0.90 are reviewed by an inspector on a monitor. The number of inspectors was reduced from 8 to 2, and the miss rate improved from 0.3% to 0.02%.
The biggest challenge during implementation was not resistance from inspectors, but rather the standardization of lighting conditions. Since the AI's classification accuracy is heavily influenced by lighting, the inspection line lighting was unified to LED continuous light and the camera angle was fixed. Preparing the physical environment took 2 months—more time than resolving any technical issues.
In the customer support department of an EC business, AI handles the initial response to inquiry emails. Upon receiving an email, the AI automatically performs inquiry category classification, searches past response history, and generates a reply draft.
Staff members review the AI-generated drafts, make any necessary revisions, and send them. High-urgency cases (return disputes, personal information-related issues, etc.) are automatically routed to a human queue for priority handling.
Before implementation, a team of 5 staff members processed 200 cases per day; after implementation, the same 5 staff members are now able to handle 600 cases. Productivity has tripled, and the average response time has been reduced from 8 hours to 45 minutes. Staff members have shared feedback such as, "We've been freed from routine inquiries and can now focus on customers who are truly in need of help."

There are cases where introducing HITL does not produce the expected results. Here I will introduce four failure patterns I have observed in the field. All of them can be avoided at the design stage.
This is the most common failure. As mentioned earlier, a design where humans receive input and AI checks it cannot benefit from scaling.
The workaround is simple: when you diagram your workflow, just check whether "the first arrow points to AI." For emails, route them directly from the inbox to AI. For application forms, have AI analyze them immediately after the PDF is uploaded. Design the system so that AI begins processing before a human opens the file and reads its contents.
The concern of "but what if AI makes a mistake?" is valid, but that is solved through confidence-based routing. Low-confidence cases are routed to humans, so AI errors will never remain in the final output.
Even after introducing HITL, if the human intervention rate exceeds 50%, the efficiency gains are barely noticeable. There are two main reasons for a high intervention rate.
The first is when the confidence threshold is set too strictly. Setting the auto-approval threshold to 0.98 "just to be safe" will route most cases to human review. A practical approach is to start at around 0.85 and adjust while monitoring the error rate.
The second is when the quality of training data is poor. If the data used for fine-tuning the AI model is biased or insufficient in volume, the confidence scores themselves will be low. In this case, improving the model comes first — adjusting the threshold alone will not solve the problem.
As a benchmark, if the intervention rate has not dropped below 30% three months after implementation, there is likely an issue with either the model or the threshold.
An unexpected pitfall is "automation bias." When AI judgment results are constantly displayed, humans begin to uncritically accept the AI's decisions. Particularly in the gray zone of confidence scores between 0.70 and 0.85, cases that should be carefully reviewed get waved through with the assumption that "the AI said it's OK, so it must be fine."
An effective countermeasure is a design that initially hides the AI's judgment results on the review screen. The human first enters their own assessment, then compares it against the AI's judgment afterward. If the decisions match, the case is approved; if they differ, it is sent for detailed review. This design suppresses automation bias while maintaining the quality of human judgment.
Another countermeasure is regular "calibration sessions." Once a month, reviewers gather to discuss cases where their judgments diverged. This corrects inconsistencies in evaluation criteria and maintains quality across the entire team.
The AI Risk Management Framework published by the U.S. Treasury Department in February 2026 defines 230 control objectives. Among these, the requirements related to HITL are three: audit log retention, explainability of decision rationale, and bias detection.
Implementing HITL without governance makes it impossible to explain after the fact "why the AI made that judgment." Traceability of the decision-making process is essential not only for regulatory compliance in finance and healthcare, but also for internal business process improvement.
As a minimum governance measure, build in a mechanism at the time of implementation to log all of the following: AI input data, output results, confidence scores, human corrections, and final decisions. Retrofitting logging functionality after the fact incurs high costs in design changes.

Here are the 5 steps for implementing HITL in your organization. The key is not to aim for company-wide deployment from the start, but to begin small and improve through feedback loops.
First, identify the "entry point" of the business process you want to automate. The entry point is where input comes in from the outside — such as receiving emails, uploading application forms, submitting inquiry forms, or acquiring sensor data.
Whether AI can be deployed at this entry point determines the feasibility of HITL implementation. If the data at the entry point is structured (CSV, JSON, standardized forms), implementation is straightforward. Even with unstructured data (free-form emails, images of handwritten documents), there is a growing number of cases where the latest LLMs and image recognition technologies can handle it.
As a selection criterion, it is advisable to start with "tasks that have high volume and relatively standardized decision-making patterns." Tasks that involve more than 100 cases per month and have documented decision-making criteria are ideal.
Design confidence thresholds tailored to the characteristics of each business operation. There are three variables to consider: the business impact of misclassification, the acceptable intervention rate, and the amount of training data available.
For operations where misclassification has a high impact (loan screening, medical diagnosis), set the automatic approval threshold at 0.95 or above. For operations where the impact is low (internal document classification, FAQ responses), around 0.80 is sufficient.
Intervention rules should also be documented in advance. Examples include: "Any case with a confidence score below 0.85 must be reviewed by a human," and "Certain categories (such as those involving personal information) require human review regardless of confidence score." To prevent decisions from becoming dependent on individual judgment, document these rules and share them across the entire team.
Before company-wide rollout, conduct a pilot operation in a single department or a single business process. A period of 2 to 3 months is a reasonable guideline.
There are four metrics to verify during the pilot: processing speed (Before/After), accuracy (misclassification rate), intervention rate (the percentage of cases reviewed by a human), and user satisfaction (both from staff and customers). Measure these metrics on a weekly basis and adjust thresholds and rules accordingly.
In the author's experience, it is normal to adjust thresholds 3 to 5 times during the pilot period. It is rare for the initially configured thresholds to carry over into production as-is; the optimal values are found through validation against real data. The mindset of "not aiming for perfection from the start" is equally important here.
The greatest strength of HITL design is that human review results become training data for the AI. Without intentionally designing this feedback loop, the AI's accuracy will remain frozen at the level it was at during initial deployment.
Specifically, corrections made by humans during review are automatically added to the training dataset, and the model is retrained on a regular basis (monthly or quarterly). In J.P. Morgan's case, this feedback loop reduced the intervention rate from 35% to 12% over three years.
To ensure the quality of feedback, reviewers are required to include comments explaining the reasons for their corrections. Rather than simply noting "the AI's extraction was incorrect," they record specific reasons such as "corrected due to a different interpretation of Article 5, Paragraph 3 of the contract." This makes it easier to identify weaknesses in the model.
Once the pilot confirms the effectiveness, establish a governance framework for production deployment. The four minimum required elements are as follows.
Audit logs: Record AI inputs and outputs, confidence scores, human corrections, and final decisions. Retention periods should align with industry regulations (7 years or more for finance).
Explainability: Establish a mechanism to explain why the AI made a given decision in terms that non-technical staff can understand. Visualizing feature importance and generating text-based reasoning are effective approaches.
Bias monitoring: Regularly check whether AI decisions exhibit any particular bias. Statistically verify differences in decisions based on attributes such as gender, age, and region.
Incident response: Define in advance the response flow for cases where AI misjudgments have a significant impact. Document the escalation path, procedures for identifying the scope of impact, and the process for developing preventive measures.

Here are answers to frequently asked questions about considering the introduction of HITL.
If designed correctly, the cases that humans need to review can be narrowed down to 15–30% of the total. The remaining 70–85% are processed automatically by AI. In other words, whereas before HITL implementation all 100 cases were handled by humans, after implementation only 15–30 cases require review.
However, as mentioned earlier, if the system is designed in the "human → AI" order, these benefits cannot be realized. The premise is a design in which AI receives the input first.
"Start with tasks that have a high volume, relatively standardized judgment patterns, and a moderate impact from misclassification." Specific examples include internal expense reimbursement checks, initial classification of inquiry emails, and invoice data entry.
Conversely, tasks to avoid are those with too low a volume (fewer than 20 cases per month results in insufficient training data for the AI) and tasks where misclassification is critical (medical diagnosis and legal judgment should not be initiated without sufficient accuracy verification).
RAG (Retrieval-Augmented Generation) is a technology that searches and references relevant information from external databases when AI generates responses, and it has a complementary relationship with HITL. In practice, a common combination is using RAG to improve AI response accuracy while using HITL to ensure final quality.
For example, in the case of an internal chatbot that references an internal knowledge base via RAG, RAG retrieves the appropriate documents and the AI generates a response. If the confidence level is high, the response is delivered as-is; if low, a staff member reviews it before responding. There is a case study where this approach achieved 99.8% response accuracy — RAG reduced hallucinations (the generation of factually incorrect responses) by 96%, with HITL allowing humans to cover the remaining uncertain responses through this two-tiered structure.

What I've consistently conveyed throughout this article is the importance of the order: "AI first, humans second." In a design where humans receive the input, costs grow in proportion to increased workload, negating the benefits of automation. AI receives the input, and humans check the AI's output — this HITL design that maintains this order is the key to embedding business automation within an organization rather than letting it end as a one-off PoC.
The first step is to identify a single "input entry point" in your own business processes and begin a small PoC by placing AI there. Start with a threshold of 0.85 and a pilot period of two months. There is no need to aim for perfection. Once the feedback loop starts turning, accuracy will improve through continued operation.
Companies where AI receives the input, and companies where humans continue to receive the input. The gap in competitiveness that emerges between these two will only widen over time. The mindset shift can begin today.
Yusuke Ishihara
Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).