
Open-weight language models, led by the GPT OSS released by OpenAI under Apache 2.0, are beginning to record benchmark scores comparable to cloud APIs on specific business tasks. In particular, SLMs (Small Language Models) are closing the accuracy gap with cloud APIs on routine tasks such as classification, summarization, and structured data extraction, and are attracting attention as an option that achieves both data sovereignty and cost optimization. This article is aimed at IT and security professionals in the finance, healthcare, and manufacturing sectors where there are restrictions on sending data to the cloud, and compares the trade-offs between local LLMs/SLMs—including GPT OSS—and cloud APIs across three axes: GPU requirements, per-task accuracy, and TCO. By the time you finish reading, you should have everything you need to determine the optimal architecture and model combination for your organization.

There are three main ways to integrate AI language models into business operations: cloud APIs, local LLMs, and SLMs. These three can be organized along two axes: "where the model runs" and "the scale of the model."
Cloud APIs are an approach that sends HTTP requests to hosted models provided by OpenAI, Anthropic, and Google. While infrastructure management is unnecessary, input data passes through the provider's servers.
Local LLMs refer to the approach of running large language models on your own servers or workstations. Large models such as Llama 4 Maverick (400B total parameters, 17B active) and GPT OSS 120B (117B total parameters, 5.1B active) fall into this category. Recent large models adopt a MoE (Mixture of Experts) architecture, which significantly reduces active parameters relative to total parameter count, enabling operation with fewer GPUs than previously required.
SLMs (Small Language Models) are lightweight models with parameter counts kept to a few billion or fewer. Representative examples include GPT OSS 20B (21B total parameters, 3.6B active), Phi-4 (14B), Gemma 3 (4B / 12B / 27B), Qwen 3 (7B), and Llama 4 Scout (109B total parameters, 17B active). A key characteristic is that "small" does not mean "low performance"—through MoE and training on high-quality data, these models can achieve accuracy on specific tasks approaching that of larger models.
GPT OSS in particular is the first open-weight model family released by OpenAI under the Apache 2.0 license, surpassing 9 million downloads on HuggingFace within weeks of its release. It shares the same tokenizer as the cloud API GPT series and supports tool calling and Chain-of-Thought reasoning, resulting in a low barrier to migration from existing GPT-based workflows.
The primary reason cloud APIs are passed over in financial and medical settings is the issue of Data Sovereignty. It is not uncommon for internal policies or industry guidelines to explicitly prohibit sending patient medical records or customer transaction data to the cloud.
| Item | Cloud API | Local LLM | SLM (Local) |
|---|---|---|---|
| Data location | Provider's servers | On-premises servers | On-premises servers / Edge |
| Network requirements | Internet required | Intranet possible | Offline capable |
| Regulatory compliance | Ensured via contracts / DPA | Fully self-managed | Fully self-managed |
| Audit trail | Provider-dependent | Full in-house control | Full in-house control |
Both local LLMs and SLMs ensure that data never leaves the organization's own environment. The difference lies in model size and required resources. The next section outlines the criteria for deciding which option to choose.

Local execution is not a silver bullet. In order to objectively assess the return on investment, it is worth distinguishing between cases where adoption is strongly recommended and cases where the cloud is sufficient.
The impact of deploying local LLMs/SLMs is most visible in the following three patterns. Since specific figures vary greatly depending on the organization and task, this section focuses on the structure of each pattern.
Pattern 1: Classification and Summarization of Confidential Data (Finance). Organizations may want to use AI to summarize loan screening documents, but cannot send customer financial information to the cloud. By deploying an SLM on an internal server, the summarization of screening documents can be automated without data leaving the premises. The degree of time savings compared to manual processing depends on document complexity and model accuracy tuning.
Pattern 2: Manufacturing Line Log Analysis (Manufacturing). An SLM deployed on a factory edge server analyzes equipment logs to detect anomaly patterns. Routing requests through a cloud API introduces network round-trip latency, whereas local execution eliminates that overhead. However, since SLM inference itself takes anywhere from tens to hundreds of milliseconds, the definition of "real-time" must be validated against operational requirements.
Pattern 3: Structuring Electronic Medical Records (Healthcare). One approach involves using an SLM to convert free-text clinical notes into SOAP format. Because patient data never leaves the internal hospital network, compliance with the Act on the Protection of Personal Information and medical information guidelines is easier to maintain. However, since the healthcare domain falls under YMYL (Your Money or Your Life), a human review process for model outputs is essential.
On the other hand, cloud APIs are the more rational choice in the following cases:
When there is data that "cannot be sent to the cloud" and "monthly token volume is substantial to some degree," the return on investment for local execution starts to make sense.

The first stumbling block in local execution is figuring out "which models can run on which hardware." With the widespread adoption of MoE architectures, total parameter count alone is no longer sufficient to determine memory requirements. The goal here is to work backwards from active parameter count and quantization level to enable model selection that fits within your budget.
The table below organizes the major open-weight models by GPU memory requirements. Note that MoE models require less memory despite their large total parameter counts, because fewer parameters are active at any given time.
| Model | Total Params | Active | Architecture | Q4 VRAM Est. | CPU Only | RTX 5090 (32GB) | H200 (141GB) | B200 (192GB) |
|---|---|---|---|---|---|---|---|---|
| Phi-4-mini | 3.8B | 3.8B | Dense | 2.5 GB | ✅ Fast | ✅ | ✅ | ✅ |
| Gemma 3 4B | 4B | 4B | Dense | 3 GB | ✅ Practical | ✅ | ✅ | ✅ |
| Gemma 3n E4B | 4B | 2B | MatFormer | 3 GB | ✅ Mobile-ready | ✅ | ✅ | ✅ |
| Qwen 3 7B | 7B | 7B | Dense | 5 GB | △ Slow | ✅ Comfortable | ✅ | ✅ |
| Phi-4 | 14B | 14B | Dense | 8 GB | △ Slow | ✅ Comfortable | ✅ | ✅ |
| GPT OSS 20B | 21B | 3.6B | MoE | 13 GB | △ | ✅ Comfortable | ✅ | ✅ |
| Gemma 3 27B | 27B | 27B | Dense | 16 GB | ✗ | ✅ Comfortable | ✅ | ✅ |
| Llama 4 Scout | 109B | 17B | MoE (16E/2A) | 55 GB | ✗ | ✗ | ✅ | ✅ |
| GPT OSS 120B | 117B | 5.1B | MoE | 61 GB | ✗ | ✗ | ✅ (MXFP4) | ✅ |
| Llama 4 Maverick | 400B | 17B | MoE (128E) | 200 GB+ | ✗ | ✗ | ✗ | ✅ (Q4) |
GPU Spec Comparison (NVIDIA official): The RTX 5090 features 32GB GDDR7 with 1,792 GB/s bandwidth (Blackwell architecture). The H200 features 141GB HBM3e with 4.8 TB/s bandwidth. The B200 features 192GB HBM3e with 8 TB/s bandwidth. Memory bandwidth is a particularly critical metric for memory-bound LLM inference.
How to read this table: The "Q4 VRAM Est." for MoE models includes the weights of all experts. Even though fewer parameters are active at once, all expert weights must be loaded into memory.
The efficiency of GPT OSS 20B is worth highlighting. Of its 21B total parameters, only 3.6B are active. With MXFP4 quantization, it fits within approximately 13GB, allowing it to run comfortably within the RTX 5090's 32GB (and even the RTX 4090's 24GB). Despite this, it achieves scores on official benchmarks that rival OpenAI o3-mini.
For maximum accuracy on a single RTX 5090, GPT OSS 20B (MXFP4) is the top candidate. With a 78% improvement in bandwidth over the RTX 4090, meaningful gains in inference throughput can also be expected. The 32GB VRAM also leaves enough headroom to run Gemma 3 27B (Q4, 16GB) comfortably.
On a single H200, both GPT OSS 120B and Llama 4 Scout become viable options. GPT OSS 120B requires approximately 61GB with MXFP4, while Llama 4 Scout requires approximately 55GB with int4, plus KV cache. With 141GB of VRAM, there is ample room for KV cache even with long contexts (128K tokens). For text-focused inference, GPT OSS 120B is the better choice; for multimodal use cases requiring image understanding, Llama 4 Scout is preferable.
On a B200, even Llama 4 Maverick (400B total parameters) comes into range. Its 192GB of HBM3e and 8 TB/s bandwidth enable single-GPU inference with Q4 quantization. NVIDIA officially claims up to 15× inference performance improvement over the H100.
Quantization is a technique that compresses model weights to low-bit formats (4-bit / 8-bit), reducing the required VRAM to less than half. With the emergence of GPT OSS, MXFP4 has been added as a new option alongside conventional quantization formats.
Regarding the relationship between quantization levels and accuracy, community validation reports have been accumulating. As a general trend, Q8 (8-bit) shows almost no detectable accuracy degradation compared to FP16, Q4 (4-bit) exhibits minor degradation depending on the task, and Q2 (2-bit) has been reported to cause clear quality loss. However, since the degree of degradation varies by model, task, and language, validation against your own use case is essential.
The practical recommendation is Q4 or higher. If the model fits in memory at Q4, running at Q4 is the efficient approach—only upgrading to Q8 if quality issues arise.
| Quantization Format | Characteristics | Recommended Use |
|---|---|---|
| GGUF | Designed for llama.cpp. Supports mixed CPU / GPU inference | CPU only to single GPU |
| GPTQ | GPU-optimized. Strong for batch inference | Servers handling multiple requests |
| AWQ | Tends to be faster than GPTQ with less accuracy degradation | GPU servers where speed is a priority |
| MXFP4 | 4-bit format applied only to MoE weights. Optimized for Blackwell / Hopper GPUs | GPT OSS / H200, B200, RTX 5090 environments |
MXFP4 is the quantization format adopted by OpenAI for GPT OSS, selectively applied to the expert weights of MoE. It delivers optimal throughput on Blackwell architecture (B200 / RTX 5090) and Hopper architecture (H100 / H200).
The benefits of quantization extend beyond VRAM reduction. The memory bandwidths of 1,792 GB/s for the RTX 5090, 4.8 TB/s for the H200, and 8 TB/s for the B200 create a synergistic effect with the reduction in weight size achieved through quantization. Even with the same model, switching to Q4 halves the amount of weight data transferred, alleviating bandwidth bottlenecks and improving token generation speed.
Simply downloading a model's weight files is not enough to run inference (text generation). An inference framework is responsible for the entire pipeline: loading the model onto the GPU, applying the tokenizer, managing the KV cache, and sampling output tokens. The choice of framework does not affect model accuracy, but it directly impacts inference speed, concurrency, and operational ease.
llama.cpp is a C/C++ inference engine that supports a wide range of environments, from CPU-only to GPU. Its strengths are its ability to load GGUF-format models directly and its minimal dependencies beyond a Python environment or CUDA drivers. A GGUF-converted version of GPT OSS is also available, and can be launched with llama-server -hf ggml-org/gpt-oss-120b-GGUF. It is well-suited for environments where setting up a Python environment is difficult, such as Windows PCs or edge devices.
vLLM is characterized by its highly efficient memory management via PagedAttention. In standard inference, the KV cache allocates contiguous memory per request, causing VRAM to become constrained as the number of concurrent requests grows. vLLM manages this in page-sized units, preventing memory fragmentation and enabling more requests to be processed in parallel with the same VRAM. GPT OSS officially supports vLLM and can be served with vllm serve openai/gpt-oss-120b --tensor-parallel-size 2. It is the top candidate when providing an internal API server to multiple departments.
Ollama is a CLI tool that wraps llama.cpp, completing everything from model download to launch with a single line: ollama run phi4. Model management (version switching, deletion) is also handled entirely through the CLI, making it ideal for the "just get it running" stage of PoC and prototyping. However, since it has no built-in request queuing or monitoring mechanisms, migration to vLLM or a similar solution will be necessary for production environments.
| Aspect | llama.cpp | vLLM | Ollama |
|---|---|---|---|
| Primary use case | Edge / PC / CPU inference | Internal API server | PoC / Prototyping |
| Concurrent requests | Limited | Highly efficient via PagedAttention | Suited for single requests |
| Setup | Compile or binary | pip install | One command |
| GPU requirement | None (CPU supported) | CUDA required | None (CPU supported) |
| Production use | △ (monitoring is DIY) | ✅ (OpenAI-compatible API) | ✗ (migration recommended) |

"I understand it runs locally, but is the accuracy acceptable?" — This is the most common question that comes up when considering adoption. The release of GPT OSS has significantly changed the answer to this question.
The following is a comparison of publicly available scores for local models and cloud APIs across major benchmarks. MMLU, HumanEval, and MATH scores are based on official figures from each model's release or third-party reproduction studies.
| Task (Benchmark) | GPT OSS 20B | Phi-4 14B | GPT OSS 120B | Cloud API (Large) | Source |
|---|---|---|---|---|---|
| MMLU (General Knowledge) | — | 78–82 | 87.2 | 88–92 | OpenAI Model Card / Official sources |
| Code Generation (HumanEval) | — | 80–86 | 89.4 (balanced) / 92.1 (deep) | 88–94 | OpenAI Model Card / Official sources |
| Math (MATH) | — | 68–75 | 78.6 (balanced) / 84.3 (deep) | 82–92 | OpenAI Model Card / Official sources |
| Tool Calling (TauBench) | o3-mini equivalent | — | Exceeds o3-mini / o4-mini equivalent | — | OpenAI Official Blog |
Note: Individual benchmark scores for GPT OSS 20B are summarized in the official model card as "comparable to o3-mini," but task-specific detailed scores are partially undisclosed. The "o3-mini equivalent" designation above is based on OpenAI's official comparison.
For text classification, short-form summarization, structured data extraction, and multilingual translation, no standardized public benchmarks have been established, making in-house validation on proprietary data essential for comparing model accuracy. As a general trend, tasks with short, formulaic inputs and outputs (such as classification and extraction) show smaller performance gaps between SLMs and cloud APIs, while tasks requiring long contexts or complex reasoning tend to favor cloud APIs.
The emergence of GPT OSS has expanded the areas where local models approach cloud APIs on public benchmarks.
Benchmark areas where local models are strong now include code generation (HumanEval) and tool calling (TauBench). GPT OSS 120B surpasses o3-mini on TauBench and records a score comparable to o4-mini (OpenAI official blog). GPT OSS 20B is also considered on par with o3-mini.
Areas where cloud APIs still hold an advantage are complex reasoning and mathematics (MATH benchmark) and multilingual tasks. Even with GPT OSS 120B's "deep mode" (a mode that takes more time for reasoning), a gap with large cloud models remains.
However, benchmark scores and real-world task accuracy do not necessarily align. Benchmarks are measured using standardized prompts and evaluation criteria, whereas real-world tasks differ in input data quality, domain-specific vocabulary, and output format requirements. While benchmarks should serve as a reference for adoption decisions, PoC validation using your own data is essential.
If general-purpose accuracy is insufficient, it can be improved by specializing the model on your own data using PEFT (Parameter-Efficient Fine-Tuning).
From a licensing perspective, GPT OSS uses Apache 2.0, which places no restrictions on fine-tuning or commercial use. Llama 4 Scout, on the other hand, uses the Llama 4 Community License, which requires a separate license for companies with more than 700 million monthly active users. For most organizations this is not a practical constraint, but it is worth noting that the terms differ from Apache 2.0.
Among PEFT methods, LoRA and QLoRA are the most widely used in practice.
LoRA is a technique that adds low-rank update matrices to the model's weight matrices, training only approximately 0.1–1% of the total parameters as additional parameters. It enables LoRA fine-tuning of Phi-4 14B on a single RTX 4090.
QLoRA combines LoRA with quantization to further halve the required VRAM. Since it allows fine-tuning of 14B models even on 16 GB-class GPUs, it is a strong option when looking to minimize upfront investment.
A typical customization workflow is as follows:
The degree of accuracy improvement achieved through domain-specific fine-tuning depends heavily on the base model's general-purpose performance, the quality and quantity of training data, and the nature of the task. It is recommended to first confirm the effect through a small-scale pilot (100–200 samples) before proceeding with full-scale data preparation.

Even if local execution proves viable in terms of accuracy, the business case won't hold up if the costs don't make sense. Here we estimate the break-even point from both the initial investment and monthly operating cost perspectives.
| Cost Item | Cloud API | SLM (RTX 5090) | Local LLM (H200) |
|---|---|---|---|
| Initial Investment | ¥0 | ¥600K–¥1M (reference) | ¥5M–¥8M (reference) |
| Monthly API / Electricity | Proportional to token volume | Electricity ¥10K–¥30K | Electricity ¥50K–¥100K |
| Operational Labor | Nearly zero | ML engineer (can be part-time) | ML engineer (dedicated recommended) |
| Model Updates | Automatic | Manual (approx. once per quarter) | Manual |
| Operable Models | — | GPT OSS 20B, Phi-4, Gemma 3 27B | GPT OSS 120B, Llama 4 Scout |
Note: Hardware prices are subject to significant fluctuation depending on market conditions. The RTX 5090 has been reported trading well above its MSRP ($1,999) due to DRAM shortages. Be sure to check the latest market prices when considering adoption.
The cost-effectiveness of the RTX 5090 is drawing attention. The fact that GPT OSS 20B (equivalent to o3-mini according to official benchmarks) can run on a 32GB consumer GPU is a factor lowering the break-even point for local AI.
Cloud API pricing varies significantly by provider, model, and contract type, and is subject to frequent revision. Always check the latest pricing pages from each provider when considering adoption. As a general trend, the higher the monthly token volume, the more advantageous the fixed-cost model of local execution becomes.
For an RTX 5090 server (assuming an initial investment of ¥800,000–¥1,000,000 and monthly operating costs of ¥30,000–¥50,000; prices are for reference only), the break-even point against cloud APIs depends on monthly token volume. Compared to the RTX 4090, the RTX 5090 increases VRAM from 24GB to 32GB and improves memory bandwidth by 78%, enabling smoother operation of MoE models such as GPT OSS 20B.
| Monthly Token Volume | Conditions Favoring Local Deployment |
|---|---|
| Under 1 million tokens | Cloud API pay-as-you-go pricing is often cheaper |
| 5 million–20 million tokens | Monthly cloud API costs begin to exceed local fixed costs. Payback period ranges from a few months to 2 years, depending on API pricing |
| Over 20 million tokens | The local fixed-cost model holds a clear advantage. The cost per token decreases as processing volume increases |
A key assumption underlying this estimate is that the monthly operating cost of a local SLM is largely independent of processing volume. Because a GPU server follows a fixed-cost model, the cost per token decreases as processing volume increases. Conversely, for light usage with low monthly token volumes, cloud API pay-as-you-go pricing will typically be cheaper.
While the RTX 5090 requires a higher initial investment than the RTX 4090, its 32GB of VRAM allows Gemma 3 27B (Q4, 16GB) to run with headroom to spare, broadening the range of available models. Whether the additional investment is justified should be evaluated based on the required model size and processing volume.
The specific break-even point varies significantly depending on the cloud API provider's model, pricing structure, and contract terms. When evaluating adoption, it is recommended to measure your organization's monthly token volume over one to two weeks and then compare cloud API quotes against the fixed costs of local execution.

Here are 3 failure patterns I've repeatedly observed in local LLM / SLM deployments. Each is technically addressable, but knowing them in advance will help you avoid unnecessary detours.
There are cases where attempting to force a 70B model to run results in degraded quality due to excessive quantization (Q2). In many cases, running a 14B model at Q4 yields higher accuracy than running a 70B model at Q2. Using a model sized to fit your hardware at Q4 or higher ultimately optimizes both accuracy and cost.
Referring back to the model selection table by GPU requirements mentioned earlier, the practical answer is to choose the largest model that can run at Q4 or higher within your own GPU memory.
Another often-overlooked factor is memory consumption from the KV cache during inference. Even if the model itself is 8GB, processing long prompts (exceeding 4,000 tokens) can consume an additional 2–4GB for the KV cache. Even on a 24GB RTX 4090, a 14B model (Q4, 8GB) + KV cache + OS can effectively exhaust 18–20GB.
There are two workarounds. The first is to limit the context length to the minimum required for your use case (if 4,096 is sufficient, don't set it to 8,192). The second is to use vLLM's PagedAttention to manage the KV cache efficiently.
Even when the initial PoC succeeds, there are cases where systems are left unattended during the operational phase. Open-weight models receive new versions every few months. As seen with Phi-3 → Phi-4 and Gemma 2 → Gemma 3, accuracy improves significantly with each new generation, even at the same parameter count.
As a minimum operational framework, it is recommended to incorporate quarterly model update evaluations along with regular monitoring of inference speed and error rates. Even without a dedicated ML engineer, this is work that an IT administrator can automate with scripts.

If data can be sent externally, cloud APIs are the more convenient option. GPT OSS should be chosen when there are data sovereignty constraints, when API costs need to be converted to fixed expenses, or when operation in an offline environment is required. According to OpenAI's official benchmarks, GPT OSS 20B achieves scores comparable to o3-mini on MMLU, HumanEval, TauBench, and other benchmarks, while the 120B model achieves scores comparable to o4-mini. However, since benchmark scores do not guarantee accuracy in real-world business tasks, PoC validation on your own use cases is essential.
Yes, it's possible. RAG (Retrieval-Augmented Generation) is a technique that embeds documents retrieved via search into prompts, so it is not dependent on model size. By combining GPT OSS 20B or Phi-4 with a vector DB (such as Qdrant or pgvector), you can build a pipeline for internal document search and answer generation. Since GPT OSS supports a context length of 128K tokens and Llama 4 Scout supports up to 10M tokens, there is plenty of flexibility when designing chunk sizes for RAG.
Depending on the hardware, local deployment can sometimes achieve lower latency for single requests, as cloud APIs incur additional overhead from network round trips and queue wait times. For GPT OSS 20B on an RTX 4090, token generation may begin faster than with a Dense 14B model, since MoE active parameters are as few as 3.6B. However, as the number of concurrent requests increases, the GPU becomes a bottleneck, making it necessary to design for throughput using batch inference frameworks such as vLLM. Since actual latency varies significantly depending on the model, quantization, prompt length, and batch size, measurement in your own environment is recommended rather than relying on benchmark values.
Production deployment experience is growing, as evidenced by case reports from the open-source community and cloud providers. There are two main points to keep in mind when deploying a QLoRA fine-tuned SLM into production. The first is training data quality control — training on noisy data can cause "catastrophic forgetting," where general performance degrades. The second is version management of LoRA adapters, as adapter retraining becomes necessary when the base model is updated. Technical details on fine-tuning are covered in the introductory article on PEFT.

The decision to adopt a local LLM / SLM comes down to three factors: "data sovereignty requirements," "monthly token volume," and "task complexity." The emergence of GPT OSS has made this decision even simpler. If there are constraints preventing data from being sent to the cloud and you need to process 5 million or more tokens per month, GPT OSS 20B (RTX 4090) or GPT OSS 120B (H100) can deliver accuracy equivalent to the cloud API's o3-mini / o4-mini locally.
Start small by validating with Ollama + GPT OSS 20B, and once you are satisfied with the accuracy, fine-tune it on your proprietary data using QLoRA, then productionize it with vLLM. Following these steps allows you to minimize initial investment while achieving both data sovereignty and cost optimization.
Yusuke Ishihara
Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).