
Edge AI is an approach in which AI models are executed on the device side (smartphones, PCs, cameras, industrial equipment, etc.) rather than in the cloud, and on-device LLM is the practice of running large language models (LLMs) on those devices. For business operations constrained by the three challenges that cloud LLMs struggle to address—low latency requirements, prohibition on data transfer outside the organization, and unstable connectivity—Edge AI is becoming the practical first choice.
This article organizes the fundamental concepts of Edge AI and on-device LLMs, the differences from cloud LLMs, a decision framework for applying them to business operations, and the implementation process, targeting IT departments and DX promotion personnel considering AI for business use. The goal is that by the end of the article, readers will be able to determine whether their own business operations should favor the edge, whether cloud is sufficient, or whether a hybrid approach is appropriate.
Edge AI is a design philosophy of "running AI right where data is generated." Compared to AI that assumes data will be sent to the cloud, it behaves fundamentally differently in three areas: latency, communication costs, and privacy.
Here, we first clarify the relationship between the two terms—Edge AI and on-device LLM—and establish their positioning relative to cloud LLMs. Having a conceptual map before diving into technical details will make it easier to understand the decision criteria for choosing between them that appear in later sections.
Edge AI is a concept that combines edge computing (a design in which processing is performed close to where data is generated) with AI, executing inference on devices, on-site equipment, gateway devices, and similar hardware rather than in the cloud. Facial recognition on smartphones, inspection cameras in factories, and dangerous behavior detection in dashcams are all typical examples of Edge AI.
Among these, running large language models (LLMs) on the device side is what constitutes on-device LLM. LLMs were originally designed to run on massive GPU clusters in the cloud, but a combination of model compression technology (quantization), the evolution of smaller models (SLM: Small Language Model), and the increasing performance of NPUs (Neural Processing Units) built into smartphones and laptops has made it possible to run models with several billion parameters on personal devices.
"Local LLM" and "on-device LLM" are concepts with significant overlap, but this article treats them as distinct. Local LLM carries the broader meaning of "running an LLM on infrastructure managed by the organization (on-premises servers or in-house GPU machines)," while on-device LLM refers specifically to cases where the model runs on the devices or on-site equipment that users physically handle. A detailed comparison of server-side local LLMs is covered in Local LLM / SLM Implementation Comparison, so this article focuses on design decisions closer to the edge.
Cloud LLMs and Edge AI (on-device LLMs) differ fundamentally in how data flows and where the infrastructure resides, even when the goal of "using a language model" is the same.
| Perspective | Cloud LLM | Edge AI / On-Device LLM |
|---|---|---|
| Where inference is executed | Provider's data center | Device, on-site equipment, or gateway |
| Input data transmission | Sent to provider's servers | Remains on the device |
| Network required | Yes (stops if connection is lost) | Not required (offline operation possible) |
| Model scale | Hundreds of billions of parameters possible | Hundreds of millions to several billion (post-quantization) |
| Billing model | Per-token usage fees | Upfront investment in device/model + electricity costs |
| Model updates | Automatic and immediate | Requires manual distribution or OTA updates |
The greatest strength of cloud LLMs is immediate access to the latest large-scale models, and they are incomparably faster to get up and running. Edge AI, on the other hand, has characteristics that cloud solutions cannot structurally replicate: data never leaves the device, performance is unaffected by network conditions, and there are no additional costs the more it is used.
The key point is that the two are not mutually exclusive—the optimal combination varies depending on business requirements. In the framework presented in the latter half of this article, business operations that meet any one of three conditions will be categorized as edge-first; those that do not will favor cloud LLMs; and cases where the decision is split will call for a hybrid configuration.
The discussion around Edge AI is not new, but over the past one to two years the situation has shifted to one where "on-device LLMs have become a practical solution for business use." Behind this shift are two structural changes.
The first is that the widespread adoption of cloud LLMs has led more companies to reassess "cost, latency, and privacy" from a business perspective. The second is that advances in quantization and smaller models have made it possible to run processes that would have required high-performance servers just a few years ago on commercially available laptops and smartphones.
Companies that have integrated cloud LLMs into their operations are more likely to encounter "unexpected obstacles" after deployment. In most cases, these involve cost, latency, privacy, or some combination of the three.
Cost: Internal chatbots and summarization batch jobs may cost only a few tens of thousands of yen per month during the PoC phase, but when deployed to a workforce of 1,000 employees with full logs fed into the model every time, token-based billing can balloon rapidly. From the perspective of the accounting team that wants to treat this as a fixed cost, an API bill that spikes fivefold in some months is a headache for budget management. Comprehensive optimization strategies for LLM costs are covered in detail in the LLM Cost Optimization Guide.
Latency: Cloud API calls carry the overhead of network round trips, queue wait times, and server-side batch processing. This is fine for use cases where second-level response times are acceptable, but for applications such as "detecting a person the moment they appear on camera" or "real-time detection of abnormal sounds on a manufacturing line," delays of even a few hundred milliseconds can be operationally fatal.
Privacy / Data Sovereignty: There is a wide range of data—patient medical records, loan documents, engineering drawings, footage of active construction sites—whose external transmission is restricted by terms of service, contracts, or regulations. In some cases, a cloud provider's encryption and regional data agreements alone are not enough to satisfy an organization's internal data governance officer.
Edge AI delivers the greatest benefit precisely in operations where constraints along these three axes are most severe. Conversely, for operations with no clear constraints on any of the three axes, moving forward with a cloud LLM will yield a faster start.
Two advances on the model side are what make on-device LLMs viable for real-world business use.
The first is quantization. This is a technique that compresses model weights from 16-bit to 8-bit or 4-bit precision, dramatically reducing the required VRAM and memory bandwidth. As a general trend, community benchmarks report that Q8 (8-bit) shows almost no detectable accuracy degradation compared to FP16, and that Q4 (4-bit) can remain within a practical range depending on the task (since results are task- and data-dependent, re-validation on your own business workloads is essential). This has allowed models that previously required a GPU to run on CPUs as well, extending their reach to laptops, smartphones, and embedded devices.
The second is the rapid improvement in quality of SLMs (Small Language Models). Microsoft's Phi series, Google's Gemma, Meta's Llama 3.2 1B / 3B, Apple's On-Device Foundation Models, and others—major players are releasing models one after another that aim for "practical quality on specific tasks even at a few billion parameters." Many of these are distributed as open-weight models, and fine-tuning for specific business applications is also possible.
In practice, smartphones have already entered a phase where OS-native on-device LLMs—such as Apple Intelligence and Google's Gemini Nano—are up and running. The assumption that "LLMs can only run in the cloud" is rapidly becoming a thing of the past.
The previous section examined why edge AI is attracting attention through the three axes of cost, latency, and privacy. Here, we go one step further and make the differences concrete from two perspectives: "cost structure and latency" and "operations and offline functionality."
In business adoption decisions, it is often a clear understanding of these operational differences—rather than conceptual distinctions—that proves decisive.
Looking at cost structure, cloud LLMs and edge AI represent polar opposites: "zero upfront investment / pay only for what you use" versus "upfront investment required / additional usage costs only electricity."
| Item | Cloud LLM | Edge AI (On-Device LLM) |
|---|---|---|
| Upfront investment | ¥0 | Devices, model development, distribution infrastructure |
| Per-use cost | Proportional to input + output tokens | Device electricity cost (essentially negligible) |
| Behavior at scale | Increases linearly with usage volume | Proportional to number of devices (model can be deployed laterally) |
| Cost predictability | Difficult to forecast monthly costs due to usage fluctuations | Close to a fixed cost; easier to budget |
When usage volume is unpredictable, cloud pay-as-you-go billing is overwhelmingly advantageous. However, once usage stabilizes and volume grows, there comes a point where the edge side overtakes the cloud in TCO (Total Cost of Ownership). The break-even point on the server side is covered in detail in Local LLM / SLM Adoption Comparison.
On latency, the edge side has a straightforward advantage in raw response speed since no network round trip is required. Cloud API calls can carry overhead of several hundred milliseconds even via the Tokyo region, which becomes a decisive factor for operations where real-time responsiveness is a requirement. Conversely, for operations where a response time of several seconds is acceptable—such as overnight batch summarization of meeting minutes—the latency difference is not a factor in the decision.
On the operational side, the greatest advantage of cloud LLMs is that "model updates are delivered automatically." When a provider swaps out the backend model, end users gain access to improved performance without doing anything. For use cases that require always using the latest model, this is a powerful characteristic.
With edge AI, by contrast, new model weights must be redistributed to every deployed device. This makes MLOps design for distribution a necessity—whether that means updates via the App Store / Google Play for smartphone apps, an OTA (Over-The-Air) update mechanism for industrial IoT devices, or an MDM (Mobile Device Management) rollout for company-issued laptops.
Finally, offline operation is worth highlighting as a strength of edge AI that cloud simply cannot match.
With cloud LLMs, an outage on the API vendor's side carries the risk of halting the entire operation. Even with an SLA in place, user business losses are often not compensated, and for operations where "never going down" has intrinsic value, edge-side redundancy is the practical answer.
The decision to adopt edge AI should be worked backward from "business requirements," not from technical capabilities or trends. When "wanting to try it" comes first, it's easy to fall into the pattern of investing in server installation and model development, only to realize later that a cloud LLM would have been sufficient.
Here, we outline the 3 conditions that make edge AI the first choice, along with a framework for distinguishing cases where a cloud LLM is adequate. At the very first stage of evaluating adoption, always examine your own operations from both sides.
A practical way to determine whether a business operation qualifies as a primary candidate for edge AI is to assess whether at least 1 of the following 3 conditions is clearly met. Operations that meet none of these conditions have little economic rationale for being pushed toward edge.
| Condition | Description | Business Examples |
|---|---|---|
| Low Latency | A response within a few hundred milliseconds is operationally required | Manufacturing line anomaly detection, driver assistance, in-store voice operation |
| Data Cannot Leave the Premises | External transmission is prohibited or sensitive due to regulations, contracts, or laws | Medical records, financial documents, design drawings, security footage |
| Unstable Connectivity | Operations must function on the assumption that communication may be interrupted | Inside logistics vehicles, construction sites, overseas locations, outdoor events |
For example, consider an operation where a small in-store camera must issue a voice alert the moment a child approaches a shelf—the entire pipeline from image capture to speech synthesis must complete within 1 second. Routing through the cloud may not be fast enough depending on connection quality, and sending a child's facial image to an external service is itself likely to raise privacy concerns. Since 2 of the 3 conditions apply simultaneously, this can be clearly identified as a primary candidate for edge AI.
Conversely, consider an operation that runs a nightly batch job to summarize the previous day's internal meeting recordings. Latency is irrelevant, and the task can be completed simply by sending data from an internal data center to a cloud API and receiving the result. There is little economic justification for investing in edge AI here.
Using the 3 conditions as a checklist reduces inconsistency in decision-making. At the project proposal stage, being able to write in a single line which condition applies—and, if multiple apply, what the priority order is—keeps subsequent decisions consistent.
As the inverse assessment, operations that fall into the following categories are often better served by choosing a cloud LLM.
An important point here is that there is room to design a solution that does not treat "3 conditions vs. cloud-suited" as a binary choice. In a hybrid configuration, sensitive preprocessing—such as masking personal information, summarizing internal documents, or filtering search targets—is handled on the edge side, and the output is then passed to a cloud LLM for advanced reasoning. This allows you to leverage the intelligence of the cloud while keeping raw data from leaving the premises.
Expressed as a simple decision flow:
Following this sequence makes it easier to avoid both over-investment and under-investment failures.
Up to this point, we have organized the rationale for choosing edge AI and the criteria for assessing business fit. When actually moving forward with adoption, it is safer to proceed in stages—from use case selection through model, hardware, and operational design.
Aiming for production deployment from the outset makes investment decisions too large, so we recommend a PoC (Proof of Concept) approach: narrow the focus to a single business operation and verify both the expected impact and the operational burden.
The first step in use case selection is to score the candidate operations raised internally across 3 axes—"degree of fit with the 3 conditions," "whether the impact can be measured quantitatively," and "whether the PoC can be completed within 2–3 months"—and then narrow down to the single top candidate. Attempting to run multiple operations in parallel tends to scatter edge-side operational know-how, leaving all of them half-finished.
For PoC design, documenting the following items upfront keeps subsequent decisions aligned.
| Item | What to Decide |
|---|---|
| Success Criteria | Quantitative business targets (e.g., detection accuracy ≥ 90%, response time ≤ 500ms) |
| Comparison Baseline | Metrics for comparison against cloud LLM / existing methods |
| Target Device | Fixed to a single model (horizontal rollout can be considered later) |
| Evaluation Data | 50–200 representative samples from your own business operations |
| Exit Criteria for Failure | The threshold below which production deployment will not proceed |
At the PoC stage, it is recommended not to fine-tune on proprietary data right away. Start by deploying a publicly available model as-is on the device, and measure accuracy, speed, and device-side resource consumption using your own business samples. If results reach 80% of expectations, techniques such as PEFT or LoRA can be used to push accuracy higher.
Conversely, a PoC may reveal that a cloud LLM delivers dramatically higher accuracy and that the 3-condition constraints were not as strong as initially assumed. A PoC that enables a well-informed decision to withdraw is itself a valuable PoC.
Once a PoC has revealed a viable path forward, the next step is selecting the three elements needed for production deployment: model, hardware, and MLOps.
Model: Determine whether a task-specific SLM is sufficient for the business task at hand, or whether a general-purpose LLM is required. For narrowly scoped tasks such as document summarization, classification, extraction, and named entity recognition, SLMs in the class of Phi-4, Gemma 3, and Llama 3.2 1B/3B are realistic candidates. Multi-step reasoning or extended conversations will require a larger model. Note that model licenses vary significantly from model to model (Apache 2.0, Llama Community License, Gemma Terms of Use, etc.), so always verify the official license directly before any commercial use.
Hardware: Device-side options differ considerably depending on the use case.
| Category | Target Devices | Suitable Use Cases |
|---|---|---|
| Smartphones / Tablets | iPhone (A-series) / Android (NPU-equipped devices) | Fieldwork, retail, on-site inspection |
| Laptops | Apple Silicon, Copilot+ PCs | Knowledge workers at their desks, internal business support |
| Industrial IoT / Embedded | NVIDIA Jetson, Raspberry Pi 5 + AI HAT | Manufacturing lines, in-vehicle systems, outdoor installations |
| Edge Servers | Compact GPU servers installed on-premises | Aggregated processing of data within a facility |
MLOps: Edge deployments require operational design considerations that are unnecessary with cloud LLMs. These include model delivery pipelines (A/B delivery, staged rollouts, rollbacks), inference log collection and monitoring, and periodic model update evaluation. For organizations that find it difficult to maintain dedicated ML engineers, the practical approach is to narrow the operational scope and design for minimal operational overhead. For workflows involving document retrieval, there is also an approach of building RAG locally on the device.
Two frequently asked questions about edge AI and on-device LLM adoption that could not be fully addressed in the main text are supplemented here.
When tasks are kept narrowly scoped, there are already many cases where on-device LLMs are entering practical territory. Movements to integrate on-device LLMs at the OS level—such as Apple Intelligence and Google's Gemini Nano—are advancing, and designs in which lightweight tasks like text proofreading, summarization, short-form translation, and notification priority judgment are handled entirely on the device are becoming increasingly common.
However, if you require knowledge and reasoning capabilities on par with the latest large cloud-based models, a gap still exists at this point in time. The realistic approach is to scope the task clearly, then supplement accuracy with fine-tuning or RAG where necessary. Thinking of on-device LLMs as a tool scoped to specific business tasks—rather than as a replacement for general-purpose intelligence—makes it easier to avoid misjudgments.
It is possible, and in practice a hybrid configuration is often the most workable solution for many business workflows. The representative pattern is a division of roles: "sensitive processing on the edge, intelligent processing in the cloud."
For example, in a clinical setting for generating consultation summaries, an on-device LLM on the local device masks and summarizes personally identifiable information from a patient's speech, and only that anonymized text is sent to a cloud LLM for advanced summarization and differential diagnosis suggestions. Raw data remains within the facility while still leveraging the cloud's inference capabilities. The same architecture can be applied to credit assessment support in finance and to enterprise chatbots.
In a hybrid configuration, the key design decision is where to draw the boundary between edge and cloud. If the boundary is too shallow, the result is "almost everything ends up being sent to the cloud anyway." If it is too deep, the result is "too much inference capability is demanded from the device, and performance suffers." A practical guideline is to identify which of the three conditions is most strongly applicable, and then draw the shallowest boundary that satisfies that constraint.
Edge AI and on-device LLMs are an approach that provides design flexibility against three constraints that are structurally difficult to resolve with cloud LLMs: low latency, prohibition on data transfer outside the organization, and unstable connectivity. The technology is still evolving, but advances in quantization and SLMs, along with the proliferation of on-device NPUs, have brought us to a stage where practical solutions for business applications are expanding.
Four key points for adoption decisions are summarized below.
For detailed comparisons of server-side local LLMs/SLMs, GPU requirements, and break-even points, refer to Local LLM / SLM Adoption Comparison. For token cost optimization during the operational phase, refer to LLM Cost Optimization Guide.
In the world of business AI, between "sending everything to the cloud" and "handling everything in-house" lies a rich design space that includes edge AI. Use this article as a resource for thinking about where in that space it is optimal to land, starting from your own organization's business constraints.

Yusuke Ishihara
Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).