
AI guardrails are a collective term for safety mechanisms that inspect and control the inputs and outputs of LLM applications, reducing risks such as prompt injection and hallucination.
While the deployment of generative AI into production is accelerating, there have been reports of applications released without adequate safeguards leading to unauthorized manipulation and the spread of misinformation. This article is aimed at engineers and product managers involved in the design, development, and operation of LLM applications, and explains—from an implementation perspective—everything from defining threat models to input/output guards, evaluation set construction, and multi-tenant support. By the end, you will have a design framework that can be immediately applied to your own services.
To design guardrails correctly, you must first clarify "what you are protecting." Starting implementation without taking stock of the risks on both the input and output sides tends to result in excessive blocking or overlooked threats.
The first thing to tackle is organizing the data your LLM application handles and identifying your expected users. Next, map out the pathways through which threats such as prompt injection and hallucination can occur. Skipping this prerequisite work tends to significantly increase implementation costs in later stages.
Before designing guardrails, you must first clarify "what you are protecting." There are assets to protect on both the input and output sides, and each has a different nature.
Assets to protect on the input side
Assets to protect on the output side
As a practical approach to this inventory, start by drawing a data flow diagram of your application and identifying every touchpoint in the "user input → LLM → external system" chain. Mapping the threats listed above to each touchpoint will naturally reveal priorities. The OWASP LLM Top 10 is a valuable reference as a starting point for this exercise.
The next section moves on to the procedure for assessing the current state of an existing application.
Before beginning guardrail design, it is essential to accurately understand the "current state" of your existing application. Proceeding with implementation while leaving design blind spots unaddressed tends to cause large-scale rework in later stages. Start by taking stock of the current situation and narrowing down your priority issues.
Key points to verify
How to conduct the inventory
Review the existing codebase and identify the content, update frequency, and owner of the system prompt. Also check whether user input is being concatenated into the prompt without validation. It is not uncommon to find multiple prompt injection entry points at this stage.
If logs exist, sample past interactions to extract problematic output patterns. If logs do not exist, prioritize implementing minimal input/output logging first. Designing guardrails without understanding the current state carries a high risk of overlooking the areas that need protection.
Guardrail design can only be translated into code once "what to protect" has been determined. This section walks through three steps in sequence, from defining the threat model to implementing guards for both inputs and outputs. While each step may appear independent, keep in mind that this is an iterative process in which the design of later steps prompts a revisiting of earlier ones.
The first step in implementing guardrails is creating a threat model that organizes what needs to be protected and the potential attack vectors. Stacking filters without a design plan carries a high risk of overlooking critical gaps.
Start by identifying all input pathways the application receives. Not only direct user input, but also external documents retrieved via RAG and API responses can serve as attack surfaces. This is the classic pathway for Indirect Injection.
Next, classify threats based on the OWASP LLM Top 10. High-priority items are listed below.
Assign each threat a risk score calculated by multiplying "probability of occurrence" by "impact level," and use this to prioritize remediation. Treating all threats equally tends to inflate implementation costs and makes it difficult to complete an MVP (Minimum Viable Product) at that stage.
Finally, diagram the trust boundaries. Clearly identifying which components are trusted and which zones are untrusted makes the input guardrail design in Step 2 and beyond more concrete.
Input guardrails are a defensive layer that detects and blocks harmful requests before user input reaches the LLM. Since they can stop the risks of prompt injection and sensitive information disclosure upstream, they are a high implementation priority.
The main inspection items can be organized into the following four categories.
As an implementation approach, while referencing the OWASP LLM Top 10 classifications, it is practical to first address high-risk patterns with rule-based methods and supplement difficult-to-judge cases with a lightweight SLM or dedicated classification model. Using LLM red-teaming frameworks such as PyRIT or Garak to automate boundary testing allows for continuous verification of gaps in the rules.
One caveat: aiming for zero false positives in input guardrails tends to lead to over-blocking. It is advisable to adjust thresholds incrementally and design an operational cycle from the outset that incorporates false-block logs into the evaluation set. The next Step 3 covers defenses on the output side, after the LLM has generated a response.
Returning LLM-generated responses as-is is just as dangerous as having no input guardrails. Output guardrails serve as the last line of defense, inspecting "what the model returned" and blocking or transforming any problematic content.
Main Inspection Items
Implementation Notes
Output guardrails directly affect latency. Running multiple inspections in series tends to significantly increase response time, so an effective approach is to run lightweight rule-based checks first and offload heavy ML model evaluations to asynchronous log analysis. Additionally, logging the reasons for block decisions makes it easier to identify false-block trends during the regression testing described later.
Guardrails do not end once implemented; a continuous cycle of evaluation and improvement is what maintains their quality. Model version upgrades and the emergence of new attack patterns mean that safety barriers that worked yesterday may not work today. This section organizes the mechanisms needed during the operational phase, from evaluation set design to building a monitoring dashboard.
Guardrails do not end once implemented. Because behavior changes with every model update or prompt modification, continuous regression testing is essential.
The evaluation set is best structured around the following three categories.
The evaluation set does not need to be large, but it is important to include each category in equal proportion. Skewing toward attack cases alone makes the false positive rate for normal cases invisible.
Continuous regression testing should be integrated into the CI/CD pipeline. The following flow is generally common.
The evaluation set must be treated as a "living document." Since attack techniques evolve, it is advisable to update it regularly while referencing public benchmarks such as HarmBench.
Note that when evaluating output guardrails that include grounding checks, adding a consistency score with RAG search results as a metric makes it easier to quantitatively track the effectiveness of hallucination suppression.
To sustain the effectiveness of guardrails, it is essential to have a mechanism that visualizes the operational state and can instantly detect anomalies. Even configurations that pass evaluation sets may encounter unexpected patterns in production traffic. Dashboards and alerts function as the "eyes of operations."
Key Metrics to Monitor
Basic Principles of Alert Design
Rather than static absolute thresholds, there is a tendency to set thresholds based on relative rate of change against a 7-day moving average. This makes it less likely that natural fluctuations due to day of the week or time of day will trigger false alerts.
Examples of recommended alerts:
Dashboard Design Considerations
Leverage AI observability-compatible tools such as Grafana or Datadog, and structure dashboards to allow drill-down by tenant and by endpoint. Logs must always include a request ID, tenant ID, and guardrail decision reason to facilitate post-hoc root cause analysis. When personal information is present, masking must be applied to the logs.
The pitfalls most commonly encountered in guardrail implementation tend to fall into two broad categories: "false positives due to insufficient testing" and "UX degradation due to over-guarding." Both are easy to overlook at the design stage and often only surface after the system goes into production. The H3 sections below examine each failure pattern and its concrete countermeasures in depth.
Cases have been reported where rushing guardrails into production results in the erroneous blocking of legitimate user requests. The cause is most often "insufficient test data" and "omission of boundary testing."
Common Patterns Where False Positives Occur
The Root Cause Is Bias in the Evaluation Set
When an evaluation set created during the PoC (proof of concept) phase is skewed toward attack patterns, coverage of normal-use cases becomes insufficient. A typical example is focusing so heavily on Prompt Injection countermeasures that a sufficient variety of everyday user utterances is never collected.
Prioritized Countermeasures
False positives immediately damage the user experience. Because this issue is two sides of the same coin as the "over-guarding" problem discussed in the next section, it is important to manage both together from the KPI design stage onward.
Making guardrails too strict can block legitimate user actions and significantly degrade the usability of an application. Cases have been reported where the intent of "ensuring safety" instead drives users away.
Typical Problems Caused by Over-Guarding
These situations tend to occur when the threshold of a prompt firewall is set too low, or when guardrail rules are deployed to production using generic settings without domain context.
An Approach That Preserves UX While Ensuring Safety
It is important to always include the "False Positive Rate for legitimate requests" in the evaluation set and to monitor it alongside the safety score. Safety and usability are not a trade-off; they can be achieved simultaneously through threshold tuning and continuous regression testing. In the multi-tenant environment application discussed next, this balance becomes even more complex, as each tenant has a different acceptable level of risk.
Carrying guardrails designed for a single-tenant environment directly into a multi-tenant environment can result in mixed policies across tenants, making unintended information leakage or excessive restrictions more likely to occur. In SaaS-based LLM applications and internal multi-department deployments, the topics, languages, and output formats that each tenant permits often differ. The H3 sections that follow explain in sequence the concrete structure of per-tenant policy design and the operational considerations that should be added for regulated industries such as finance and healthcare.
In multi-tenant environments, permitted operations, prohibited words, and output formats differ from tenant to tenant, making it essential to design guardrails that are isolated and managed on a per-tenant basis. A hierarchical structure that allows tenant-specific policies to override a common guardrail layer is effective.
Core Design Principles
This three-tier structure enables flexible operation—for example, allowing one tenant to output medical terminology while blocking the same terms for another tenant.
Implementation Considerations
Easily Overlooked Risks
Cases have been reported where a user from Tenant A exploits prompt injection to elicit operations permitted under Tenant B's policy—a "confused deputy problem." As a countermeasure, it is recommended to incorporate signature verification of the tenant ID before guardrail evaluation.
Policy change history should be recorded in an AI observability platform and retained as an audit trail, which streamlines regulatory compliance and incident investigation.
In regulated industries such as finance, healthcare, and law, guardrails are directly tied to compliance requirements, and implementations tend to demand stricter design than in general use cases.
Key Considerations for Regulated Industries
Implementation Considerations
In the healthcare domain, misinformation caused by Hallucination is directly linked to patient safety. Incorporating a Grounding Check into the output pipeline and explicitly citing the source documents used in RAG (Retrieval-Augmented Generation) can suppress unfounded assertions.
In the financial domain, the risk of Sensitive Information Disclosure is particularly high. In addition to data isolation between tenants, it is advisable to adopt an architecture based on the principle of Privacy by Isolation.
Because regulatory requirements are subject to revision, it is recommended to establish a governance framework aligned with ISO/IEC 42001 (AI Management System standard) and to set up a regular review cycle.
Q1. Does introducing guardrails increase latency?
When classification models are inserted for both input and output, additional latency of tens to hundreds of milliseconds tends to occur. The impact can be minimized through a multi-stage configuration—such as adopting a lightweight SLM for guarding purposes and passing requests through a rule-based filter first to reject obvious violations early.
Q2. Can prompt injection attacks be completely prevented?
At present, "complete prevention" is difficult, and defense-in-depth is the standard approach. Combining input sanitization, system prompt hardening, and output grounding checks can significantly reduce the Attack Success Rate (ASR). Continuous verification through regular fuzzing and red-teaming is important.
Q3. Are there open-source guardrail libraries available?
Open-source tools such as Garak and PyRIT are publicly available for evaluating LLM vulnerabilities. However, licenses and supported models are subject to change, so please check the official documentation for the latest information.
Q4. Can guardrails be applied to multimodal AI as well?
When handling images and audio in addition to text, it is necessary to prepare a classification model for each modality. Handling multimodal jailbreaks is still largely in the research stage, and a practical first step at present is to establish a system that individually logs inputs and outputs for each modality and detects anomalies.
Q5. Are guardrails useful for regulatory compliance (EU AI Act, ISO/IEC 42001, etc.)?
Guardrail logs and evaluation records can be leveraged as evidence for AI governance. However, meeting regulatory requirements also necessitates additional steps such as risk classification and documentation. Consulting with a specialist is recommended.
Implementing AI guardrails is not a one-time configuration task. Threats evolve and user behavior patterns change, making a continuous improvement cycle essential.
Summarizing the key points covered in this article:
One point deserving particular attention is the risk that a "as long as it's protected, it's fine" mindset leads to UX degradation through over-guarding. While prompt injection countermeasures and hallucination suppression are important, mistakenly blocking legitimate requests will erode user trust. A realistic approach is to remain conscious of the trade-off between safety and usability, regularly update evaluation sets, and continuously track both the false positive rate and the attack pass rate.
Taking into account regulatory developments such as NIST guidelines and the EU AI Act, it is recommended to advance operational design from the perspective of AI TRiSM in conjunction with organization-wide governance. Guardrails are not only a technical safety barrier—they are also the foundation for building trust in LLM applications.

Yusuke Ishihara
Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).