
An AI Gateway is a middleware layer that consolidates calls to multiple LLM providers—such as OpenAI, Anthropic, and Google—and implements authentication, cost management, failover, and audit logging in a unified manner. This article organizes the problems AI Gateway solves, the selection criteria for major products (Cloudflare AI Gateway / LiteLLM / Portkey / Helicone), and the steps for incrementally integrating them into existing systems—aimed at CTOs and SREs venturing into multi-LLM operations. In our own experience building AI infrastructure for Japanese companies expanding into Thailand, we have seen firsthand how much the presence or absence of a Gateway affects operational burden and cost visibility.
As more companies adopt multiple LLM providers for different use cases, a layer known as the "AI Gateway" is becoming established as a common foundation. This section first clarifies what an AI Gateway is and why it has come to be needed.
An AI Gateway is middleware that aggregates API requests from applications to LLM providers and centralizes authentication, routing, logging, and guardrails. The application side calls only the Gateway's single endpoint, and the Gateway handles routing behind the scenes to OpenAI, Anthropic, Google, various OSS models, and others.
This concept itself is not new. In the world of microservices, the "API Gateway" is already well established, with Kong and AWS API Gateway handling authentication, rate limiting, and logging. An AI Gateway is simply the LLM-specific version of this—the only difference being that the target is "calls to external LLMs."
However, there are two factors unique to LLMs. The first is token-based billing, which shifts the granularity of cost management from the number of API calls to the number of tokens. The second is that performance, cost, and latency vary significantly across models, creating a strong demand for use-case-specific routing. These differences are what make AI Gateway a distinct category.
Until a few years ago, most teams found "a single provider—OpenAI—was sufficient." However, the pattern of using multiple models in combination has spread rapidly. There are three reasons for this.
Implementing calls to multiple models can start simply enough—separating environment variables and routing with conditional statements. However, when you later try to add failover, cost aggregation, and guardrails, the same code ends up scattered across every application. A Gateway is the container that confines that sprawl to a single place. In our experience, once three or more models are being used in combination, the cost of implementing a Gateway is almost always recovered.
The features that make up an AI Gateway can generally be divided into three layers: unified interface and failover, cost visibility and rate limiting, and guardrails and observability logging. The following sections examine which real-world challenges each of these addresses.
The most fundamental role of a Gateway is to abstract away API differences between providers. The dominant implementation places an OpenAI-compatible request format at the entry point and internally converts calls to Anthropic, Google, Bedrock, and others. Application-side code only needs to know the Gateway's endpoint, and model switching is completed entirely through Gateway-side configuration changes.
On top of that, failover is critical. In production, cases where OpenAI returns 5xx errors do occur in practice. By having the Gateway handle retries and switching to alternative providers, the application side only needs to handle the case where the Gateway itself ultimately fails.
Routing strategies include variations such as: (a) primary + fallback, (b) escalation from the lowest-cost model, and (c) traffic splitting for A/B testing. If you don't decide upfront during design "which model is primary and what triggers a switch under what conditions," the Gateway simply becomes a single point of failure.
One surprisingly difficult challenge in LLM operations is understanding "who used how many tokens, in which application." Each provider's console only shows monthly aggregates, making it impossible to break down usage by internal application or department.
Since an AI Gateway sits in the path of every call, it can tag each request with identifiers such as app_id and user_id, and aggregate token counts and costs. This makes it practical to identify applications causing cost overruns, allocate charges to billing departments, and apply rate limits to free-tier users.
In practice, the value lies in preventing incidents such as "a particular internal tool was consuming ten times the expected tokens" or "a high-cost model that was being evaluated for a sales demo was accidentally left in production." Designing rate limit granularity as a combination of "per-user + per-IP + per-plan" also makes it easier to prevent free-plan users from exhausting the budget.
Because an AI Gateway passes through both requests and responses, it is a logical insertion point for guardrails. Typical guardrails include the following:
In addition, the ability to retain all requests and responses as logs is another value of the Gateway. Scenarios where you need to "look back at what was output at that time" will inevitably arise — for replaying incidents, tracking quality degradation, and presenting evidence during internal audits. Centralizing logging on the Gateway side results in fewer gaps than implementing it individually on the application side.
However, since logs may contain personal information, retention periods, access controls, and encryption must be decided before you begin.
Options in the AI Gateway space fall broadly into two categories: fully managed SaaS and OSS self-hosted solutions. The following sections outline the characteristics of representative products in turn.
Cloudflare AI Gateway is a managed AI Gateway that runs on Cloudflare's edge network. Applications simply call the endpoint provided by Cloudflare, and logging, caching, rate limiting, and cost aggregation are available out of the box.
Its strength is the extremely low barrier to entry. With just a Cloudflare account, you can create a route in minutes from the settings screen, and it can proxy major providers directly, including Workers AI, OpenAI, Anthropic, Replicate, and Hugging Face Inference.
On the other hand, because it runs on the edge, there are constraints on arbitrarily inserting custom guardrail processing. If you need complex policies or support for custom providers, it is more practical to combine it with the OSS options discussed later or to consider an alternative SaaS. Before going to production, always verify the latest pricing and the list of supported providers in the official documentation.
LiteLLM is an MIT-licensed OSS library/proxy designed to aggregate over 100 LLM providers into a single OpenAI-compatible interface. It can be embedded directly into an application as a Python library or launched as a standalone proxy server.
When self-hosted, it can be run on Docker/Kubernetes with user management, team management, budget limits, and per-model routing configured. The community is active, and support for new models is added quickly.
For teams that want to operate a Gateway in an on-premises or isolated VPC environment, it tends to be the first candidate considered. On the other hand, unlike SaaS solutions, monitoring, updates, and scaling must be handled in-house, which presupposes a dedicated SRE team. Before adopting it in production, it is recommended to verify the list of supported providers and license terms in the official repository.
Portkey is an AI Gateway product that offers both a managed SaaS and an OSS proxy, providing integrated observability logging, guardrails, prompt template management, and caching. It is characterized by the ease of adoption—requiring only an SDK swap—and its rich management features aimed at enterprise use.
Helicone is an OSS/SaaS solution focused on LLM observability, offering request logging, per-user usage tracking, cost aggregation, and caching. It supports both an SDK-swap approach and a proxy approach that requires only switching the base URL.
The two products share considerable feature overlap, but Portkey leans toward guardrails and enterprise management, while Helicone leans toward observability logging and cost analysis. For organizations already using multiple providers that want to start with observability, Helicone is the easier choice to justify; for those who want integrated management including guardrails, Portkey provides a clearer decision axis.
When AI Gateway is proposed, two questions come up frequently: "Isn't it just a reverse proxy?" and "Does adding one automatically reduce costs?" Both are misconceptions that can distort on-the-ground decision-making. Let's address each in turn.
Technically speaking, an AI Gateway is a type of reverse proxy. However, when asked "how is it different from putting nginx in the middle?", the answer is: "Can it incorporate processing that understands the LLM context?"
A standard reverse proxy simply relays byte streams—it does not count tokens, detect signs of prompt injection, or mask PII. The essential difference with an AI Gateway is that it understands the JSON structure of LLM APIs and can apply LLM-specific processing to the text of requests and responses.
As a result, when an SRE concludes that "a load balancer is sufficient" and tries to substitute an existing nginx/Envoy setup, cost aggregation and guardrails are left out, and the logic ends up being written back into the application layer anyway. It is better to think of an AI Gateway as a "control plane purpose-built for LLMs"—doing so leads to fewer costly reversals down the line.
The claim that "adding a Gateway will reduce costs" is commonly heard in vendor sales contexts. In reality, it is not that straightforward.
A Gateway itself is not a device that "reduces" costs—it is one that makes them "visible." Whether costs actually decrease depends on whether, after achieving that visibility, you implement: (a) routing to cheaper models by use case, (b) enabling prompt caching and semantic caching, and (c) suppressing unnecessary high-frequency calls.
Immediately after deployment, costs may actually increase by the amount of the Gateway's own operating expenses. This is precisely why it is dangerous to frame "Gateway adoption = cost reduction KPI" from the outset. The realistic approach is to align with stakeholders on the sequence: "cost visualization → optimization → reduction as a result."
AI Gateway is better suited for gradual replacement than a big-bang rollout. We introduce the approach taken across multiple projects, broken down into two phases.
The first phase focuses exclusively on "routing existing LLM calls through the Gateway." No new features are added.
Concretely, the following steps are carried out:
The discipline of "not adding features" is critical in this phase. Enabling guardrails, cost allocation, caching, and other capabilities simultaneously makes it difficult to isolate the cause when issues arise. The goal first is to establish a state where "the Gateway has simply been inserted in the middle, with behavior remaining identical."
Once Phase 1 is stable, the inherent value of the Gateway is unlocked incrementally.
The order of priority is as follows:
Rather than enabling everything at once, a realistic pace is adding one to two features per quarter. This makes both the operational burden on the organization and the returns clearly visible.
Adopting an existing product is the practical choice. LLM providers change their APIs, models, and pricing on a monthly basis, so the maintenance cost of an in-house Gateway tends to grow to match or exceed the cost of developing application features. Organizations that cannot afford to allocate SRE resources to areas that do not directly contribute to differentiation have all the more reason to make the rational decision of adopting an existing product.
In most cases, yes. Existing API Gateways do not understand LLM JSON structures or token-based billing, so cost allocation, PII masking, and prompt injection countermeasures would need to be implemented separately. A manageable approach is to assign the existing API Gateway to the front layer and the AI Gateway to the back layer, with clearly separated responsibilities.
This is not recommended. Doing so undermines the purpose of centralizing cost allocation, guardrails, and observability logging. The practical approach is to consolidate to a single Gateway per governance unit — such as per organization or per tenant.
It can indeed become a single point of failure. When choosing a managed SaaS solution, the SLA and regional configuration must be carefully evaluated. When self-hosting an OSS solution, redundancy, health checks, and fallback procedures must be designed in from the start.
The key points of AI Gateway adoption are summarized below.
Once you have taken the step toward operating multiple LLMs, designing the Gateway in from the start — rather than treating it as something to add later — significantly reduces operational burden and governance costs in the years ahead. In the engagements we support, the presence or absence of a Gateway directly affects the pace at which AI infrastructure matures.

Yusuke Ishihara
Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).


