
The design approach of combining cloud LLMs with on-device SLMs and switching the processing destination based on the nature of the task is called "hybrid LLM design."
Relying on a single model creates trade-offs across three dimensions: cost, latency, and data protection. Cloud LLMs offer high accuracy but come with communication costs and delays, and can pose problems when sensitive data must be sent externally. On-device SLMs, on the other hand, excel in privacy and response speed but tend to fall short for complex reasoning tasks.
This article is aimed at AI engineers and system architects, and explains concrete strategies for routing tasks from the perspectives of cost, latency, and compliance. By the end, readers will have a clear understanding of how to choose the right hybrid configuration for their own systems and what to watch out for during implementation.
The design philosophy of combining cloud LLMs with on-device SLMs is called hybrid LLM design. The core concept is not to delegate all tasks to a single model, but to route each task to the most appropriate model based on the nature of the processing involved.
Simultaneously satisfying all three requirements—cost, latency, and data protection—is difficult with either a cloud-only or on-device-only approach. This reality is what elevates the hybrid configuration to a practical choice.
The sections that follow will cover, in order, the differences from single-model operation, the axes for routing decisions, and representative implementation patterns.
"Hybrid design," which combines cloud LLMs with on-device SLMs (Small Language Models), is fundamentally different from a simple choice of "which one to use." The biggest difference from operating either in isolation is the presence of a "routing layer" that dynamically directs processing to the appropriate destination based on the nature of the task.
In single-model operation, all requests flow to the same model. In hybrid design, by contrast, the system evaluates the characteristics of a task at the moment it is received and routes it to the optimal model.
The key differences can be summarized as follows:
The background to why coexistence becomes "inevitable" lies in the fact that real-world applications are not composed of uniform tasks. For example, in a business app on a smartphone, tasks requiring immediate responsiveness—such as input completion—coexist with tasks that require multi-step reasoning as part of a Compound AI System. Routing everything uniformly to the cloud increases latency and cost, while processing everything uniformly on-device can lead to quality degradation in certain cases.
Hybrid design is a structure that presupposes this non-uniformity, enabling trade-off optimization that single-model operation cannot achieve.
When operating with a cloud LLM alone, issues of cost and latency tend to surface first. Sending large volumes of tokens to the cloud every time causes API costs to accumulate quickly, and network round-trip delays are unavoidable. In scenarios where response speed directly affects UX—such as mobile apps or terminals on a manufacturing line—this latency can be fatal.
Data protection considerations are equally important and cannot be overlooked. In regulated industries such as healthcare, finance, and legal services, the very act of transmitting personal or confidential information to an external server carries the risk of violating the EU AI Act, GDPR, and data localization regulations in various countries. With a cloud-only configuration, meeting these compliance requirements tends to complicate the overall design.
On the other hand, relying solely on an on-device SLM means hitting a capability ceiling.
In short, cloud and on-device approaches each carry different trade-offs, and attempting to cover all tasks with just one of them will inevitably require compromise somewhere. The routing decision axes explained in the next section are a thinking framework for using both in combination according to task characteristics—built on the premise of this structure where neither option alone is the optimal solution.
"Routing"—deciding which model to send a task to—is the core of hybrid design. A poor routing decision can turn an intended cost reduction into increased latency, or result in a compliance violation when convenience is prioritized.
This section organizes the three axes of cost, latency, and compliance (including data protection). These axes are not independent; in practice, routing design requires considering multiple axes simultaneously. The details of each axis are explored in the H3 sections that follow.
Cost optimization is one of the most direct motivations for hybrid design. Cloud LLMs carry a high per-token cost, and when large volumes of requests accumulate, monthly expenses tend to spike sharply. On-device SLMs, by contrast, have near-zero inference costs, so assigning token-heavy or high-frequency tasks to an SLM can significantly reduce expenditure.
Core Metrics for Task Routing
Practical Approach
A good starting point is to divide tasks into two categories—"short-form, structured, low-risk" and "long-form, unstructured, high-risk"—and assign the former to the SLM as the default lane. From there, measuring monthly token consumption sent to the cloud and tracking the percentage successfully handled by the SLM as an AI ROI metric makes it easier to drive a continuous improvement cycle.
Latency and offline support are the most intuitive axes for deciding how to route between cloud LLMs and on-device SLMs. There are cases where the user experience differs markedly between the cloud, which incurs several hundred milliseconds of overhead from network round-trips alone, and an SLM, which can respond immediately via local inference.
Decision Criteria Based on Latency Requirements
Requirements for Offline and Unstable Connectivity Environments
In environments where network connectivity cannot be guaranteed—such as factory floors, aircraft, or remote field sites—routing to a cloud LLM is simply not feasible. In such scenarios, an architecture where an SLM is deployed as edge AI resident on the device, with a cloud LLM used to validate and supplement results once connectivity is restored, proves effective.
Design Considerations
It is important to evaluate latency requirements based on P95–P99 outliers rather than average values. Since cloud API response times tend to spike during periods of high traffic, incorporating an on-device SLM as a fallback in scenarios with strict SLA requirements helps stabilize service quality.
Note that routing considerations from a compliance perspective will be covered in detail in the next section.
Data sensitivity and regulatory requirements are often the highest-priority decision axis when routing between cloud LLMs and on-device SLMs.
Categories of Data That Should Not Be Sent to the Cloud
Sending such data to a cloud LLM introduces the risk of that data passing through third-party infrastructure. In situations where compliance requires explicit control over the location of processing, local processing via an on-device SLM becomes the default approach.
Routing Approach Based on Regulations and Policies
| Condition | Recommended Route |
|---|---|
| Contains personal or confidential information | On-device SLM |
| Contains only publicly available general information | Cloud LLM |
| Subject to industry-specific regulations (finance, healthcare, etc.) | On-device SLM preferred |
| Cloud usage approved under internal policy | Cloud LLM acceptable |
It is recommended to adopt a Privacy-by-Isolation approach, designing systems so that sensitive data never leaves the device boundary.
That said, even when using a cloud LLM, it is essential to review the contractual data processing terms (DPA) and confirm the model's training data opt-out settings. Combining this with guardrail configuration and incorporating a pipeline that detects and masks PII before transmission tends to safely expand the scope of cloud utilization.
There are multiple routing strategy approaches, each reflecting a different tradeoff between design complexity and operational cost. Broadly speaking, three representative patterns exist: static routing, which pre-defines task types and assigns them to fixed destinations; cascade routing, which dynamically switches based on the model's output confidence; and SLM draft + cloud validation, in which the SLM generates a draft that the cloud LLM then verifies. Since each pattern differs in its applicable use cases and implementation complexity, selecting the pattern that fits your organization's requirements is the starting point for system design.
Static routing is an approach in which task types are defined in advance and processing destinations are fixed according to a rule table. Its greatest strengths are low implementation cost and high predictability of behavior, owing to the simplicity of the decision logic.
Basic Routing Logic
The classification axes most commonly used are a combination of three factors: input token count, task category ID, and user role. For example, a rule can be written such that if the input is 256 tokens or fewer and the category is "FAQ," the request goes to the SLM; otherwise, it goes to the cloud LLM.
Implementation Considerations
Static routing depends on the accuracy of the classification defined at design time. If task category granularity is too coarse, processing that an SLM could handle adequately will flow to the cloud LLM, causing costs to balloon. Conversely, if classification is too fine-grained, the maintenance burden on the rule table increases.
After going live, it is advisable to use AI observability tools to continuously measure latency and quality scores per route, and to periodically revisit the classification rules.
Static routing is most effective when task types are stable and quality requirements are clearly documented. If requirements are in flux, it is worth considering combining it with the dynamic routing approach introduced next.
Confidence-based dynamic routing (Cascade) is an approach that uses the confidence score of an SLM's output as a threshold to automatically determine whether to escalate to a cloud LLM. Unlike static routing, it uses "how certain that inference is" as its criterion rather than task type, allowing it to handle unexpected inputs flexibly.
How It Works
Benefits and Considerations
Implementation Notes
Rather than using a fixed threshold value, it is advisable to continuously adjust it based on real-world operational logs collected via AI observability tools. In addition, combining the confidence score with an output consistency check (a technique that samples the same input multiple times and examines the variance) can compensate for cases where the score becomes overconfident.
The "SLM draft → cloud LLM verification" pattern introduced in the next section can be positioned as a further evolution of this Cascade architecture.
This is a two-stage pattern in which an on-device SLM rapidly generates a draft, which a cloud LLM then verifies and supplements. It is particularly effective in situations where latency, cost, and quality need to be optimized simultaneously.
How It Works
Scenarios Where This Pattern Is Suitable
Cost Impact
By pre-filtering requests to the cloud LLM through the SLM, the volume of tokens transmitted tends to decrease. Since simple inquiries and templated responses are handled entirely by the SLM, the number of cloud API calls can be reduced.
Design Considerations
By combining this with the implementation details in the next section and comparing the operational costs of each pattern, you can assess how well each fits your own environment.
To not only "design" a routing strategy but also "keep it running," it is necessary to understand the implementation complexity and operational overhead in advance. The static, dynamic, and hybrid patterns each differ significantly in initial build cost and ongoing maintenance effort. Identifying which pattern is compatible with your team's size and technology stack is the starting point for a sustainable design.
Static routing is a method that classifies task types in advance using rules and fixes the processing destination (SLM or cloud LLM). Its greatest strength lies in its simple branching logic, which results in low implementation cost and high predictability of behavior.
Basic Implementation Structure
task_type: summarize / translate / reason)1if task_type in ["faq", "translate_short"]:
2 route → on-device SLM
3elif task_type in ["legal_review", "multimodal"]:
4 route → cloud LLMSuitable Scenarios
Static routing excels in cases where task types are determined in advance within a business workflow.
Considerations
When inputs with ambiguous task boundaries are introduced, misrouting tends to occur. It is essential to always define a fallback destination for cases that "don't fit either category." Additionally, designs that rely on user input for labeling carry the risk of unintended manipulation, so it is recommended to implement a mechanism that automatically assigns labels on the system side. With an eye toward migrating to the dynamic routing covered in the next section, recording task labels and actual processing results in logs will be useful for later accuracy evaluation.
Dynamic routing is a mechanism that evaluates an SLM's confidence score in real time and escalates to the cloud LLM only when the score does not exceed the threshold. The core of the implementation lies in two elements: the "confidence determination logic" and the "fallback trigger."
Basic Implementation Flow
Note that the specific threshold value varies significantly depending on system configuration and task characteristics, so validation in your own environment is necessary. In practice, it is realistic to set thresholds separately by task type. Tasks where the impact of errors is small, such as summarization or classification, should have a lower threshold, while tasks requiring high accuracy, such as contract review or medical summarization, should have a higher one.
Key Metrics to Monitor
It is advisable to introduce an AI observability tool and design it to automatically trigger an alert when the escalation rate exceeds a baseline value over a given period. The baseline value must be configured and validated according to your own use case. Periodically recalibrating the threshold itself and keeping it aligned with model version upgrades and changes in domain data is the key to sustained operation.
Before actually implementing a hybrid design, organizing several decision criteria in advance can reduce rework in later stages. Selecting a routing policy that simultaneously satisfies cost estimates, latency targets, and data protection requirements is what determines design quality. The next H3 section dives deeper into a critical yet often overlooked perspective: alignment with existing guardrails and governance policies.
When introducing a hybrid design, failing to verify alignment with existing guardrails and governance frameworks in advance tends to cause serious compliance issues after deployment. Because the scope of applicable policies differs between cloud LLMs and on-device SLMs, a centralized management mechanism is essential.
Key Points to Verify
Alignment work done earlier in the design phase results in less rework. Documenting the routing logic while referencing existing AI TRiSM frameworks and internal security policies will significantly simplify future audit responses.
When actually considering and implementing a hybrid design, questions arise one after another—"How do you build the router?" and "What do you do when the SLM makes a wrong judgment?" This section addresses questions that are particularly common in design and operational settings, providing concise answers. Clarifying these questions before getting into detailed implementation helps streamline subsequent decision-making.
Router implementation approaches fall broadly into two categories: "rule-based" and "LLM-based." Since each is suited to different situations, there is no need to decide on one universally.
Rule-Based Router
LLM-Based Router
In practice, a two-stage approach—first performing a rough sort with rule-based routing, then passing only difficult cases to an SLM—is considered effective. For example, inputs that satisfy conditions such as "token count below a certain threshold and no confidentiality flag" are sent immediately to the on-device SLM, while all others are evaluated by an SLM-based classifier for intent before being forwarded to the cloud LLM.
The following table summarizes selection guidelines.
| Condition | Recommended Router |
|---|---|
| Few, stable task types | Rule-based |
| Diverse tasks, frequent changes | LLM-based (lightweight SLM) |
| Low latency is the top priority | Rule-based |
| Accuracy is the top priority | LLM-based |
It is also important to include the router itself as a monitoring target within AI observability. Since accumulated misclassifications can lead to increased costs and quality degradation, incorporating regular reviews of routing logs into the operational workflow is advisable.
A hybrid design combining cloud LLMs and on-device SLMs is a practical approach for simultaneously optimizing across three axes: cost, latency, and compliance. A configuration that relies on only one or the other will inevitably require compromise on at least one of these axes.
Looking back at the routing strategies covered in this article, the options can be broadly organized into three categories.
As a starting point for design, it is recommended to first identify which tasks handle sensitive data. Once compliance requirements are clearly defined, the scope of on-device processing naturally becomes apparent, and the volume of tokens sent to the cloud can be reduced accordingly.
Next, incorporate guardrails and AI governance policies upstream of the router. Since the routing decisions themselves can become a new risk surface, a continuous monitoring framework using AI observability to track decision logs is indispensable.
A hybrid configuration is not something you build once and consider finished. Building a continuous review cycle for routing rules into the design phase from the outset—one that adapts to model updates and evolving business requirements—is what drives long-term improvement in AI ROI.

Yusuke Ishihara
Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).