Local LLM / SLM Deployment Comparison — AI Utilization Without Cloud API Dependency

AI & Machine Learning LLM Operations & RAG

Updated:March 17, 2026Published:March 16, 2026

Local LLM / SLM Deployment Comparison — AI Utilization Without Cloud API Dependency

Open-weight language models, led by the GPT OSS released by OpenAI under Apache 2.0, are beginning to record benchmark scores comparable to cloud APIs on specific business tasks. In particular, SLMs (Small Language Models) are closing the accuracy gap with cloud APIs on routine tasks such as classification, summarization, and structured data extraction, and are attracting attention as an option that achieves both data sovereignty and cost optimization. This article is aimed at IT and security professionals in the finance, healthcare, and manufacturing sectors where there are restrictions on sending data to the cloud, and compares the trade-offs between local LLMs/SLMs—including GPT OSS—and cloud APIs across three axes: GPU requirements, per-task accuracy, and TCO. By the time you finish reading, you should have everything you need to determine the optimal architecture and model combination for your organization.

What Are Local LLMs and SLMs? Fundamental Differences from Cloud APIs

There are three main ways to integrate AI language models into business operations: cloud APIs, local LLMs, and SLMs. These three can be organized along two axes: "where the model runs" and "the scale of the model."

Definition and Positioning of Local LLMs, SLMs, and Cloud APIs

Cloud APIs are an approach that sends HTTP requests to hosted models provided by OpenAI, Anthropic, and Google. While infrastructure management is unnecessary, input data passes through the provider's servers.

Local LLMs refer to the approach of running large language models on your own servers or workstations. Large models such as Llama 4 Maverick (400B total parameters, 17B active) and GPT OSS 120B (117B total parameters, 5.1B active) fall into this category. Recent large models adopt a MoE (Mixture of Experts) architecture, which significantly reduces active parameters relative to total parameter count, enabling operation with fewer GPUs than previously required.

SLMs (Small Language Models) are lightweight models with parameter counts kept to a few billion or fewer. Representative examples include GPT OSS 20B (21B total parameters, 3.6B active), Phi-4 (14B), Gemma 3 (4B / 12B / 27B), Qwen 3 (7B), and Llama 4 Scout (109B total parameters, 17B active). A key characteristic is that "small" does not mean "low performance"—through MoE and training on high-quality data, these models can achieve accuracy on specific tasks approaching that of larger models.

GPT OSS in particular is the first open-weight model family released by OpenAI under the Apache 2.0 license, surpassing 9 million downloads on HuggingFace within weeks of its release. It shares the same tokenizer as the cloud API GPT series and supports tool calling and Chain-of-Thought reasoning, resulting in a low barrier to migration from existing GPT-based workflows.

Three Architectures from a Data Sovereignty Perspective

The primary reason cloud APIs are passed over in financial and medical settings is the issue of Data Sovereignty. It is not uncommon for internal policies or industry guidelines to explicitly prohibit sending patient medical records or customer transaction data to the cloud.

Item	Cloud API	Local LLM	SLM (Local)
Data location	Provider's servers	On-premises servers	On-premises servers / Edge
Network requirements	Internet required	Intranet possible	Offline capable
Regulatory compliance	Ensured via contracts / DPA	Fully self-managed	Fully self-managed
Audit trail	Provider-dependent	Full in-house control	Full in-house control

Both local LLMs and SLMs ensure that data never leaves the organization's own environment. The difference lies in model size and required resources. The next section outlines the criteria for deciding which option to choose.

Companies That Should Choose Local LLM / SLM vs. Those That Don't Need To

Local execution is not a silver bullet. In order to objectively assess the return on investment, it is worth distinguishing between cases where adoption is strongly recommended and cases where the cloud is sufficient.

Cases Where Implementation Is Strongly Recommended (Finance, Healthcare, Manufacturing)

The impact of deploying local LLMs/SLMs is most visible in the following three patterns. Since specific figures vary greatly depending on the organization and task, this section focuses on the structure of each pattern.

Pattern 1: Classification and Summarization of Confidential Data (Finance). Organizations may want to use AI to summarize loan screening documents, but cannot send customer financial information to the cloud. By deploying an SLM on an internal server, the summarization of screening documents can be automated without data leaving the premises. The degree of time savings compared to manual processing depends on document complexity and model accuracy tuning.

Pattern 2: Manufacturing Line Log Analysis (Manufacturing). An SLM deployed on a factory edge server analyzes equipment logs to detect anomaly patterns. Routing requests through a cloud API introduces network round-trip latency, whereas local execution eliminates that overhead. However, since SLM inference itself takes anywhere from tens to hundreds of milliseconds, the definition of "real-time" must be validated against operational requirements.

Pattern 3: Structuring Electronic Medical Records (Healthcare). One approach involves using an SLM to convert free-text clinical notes into SOAP format. Because patient data never leaves the internal hospital network, compliance with the Act on the Protection of Personal Information and medical information guidelines is easier to maintain. However, since the healthcare domain falls under YMYL (Your Money or Your Life), a human review process for model outputs is essential.

When Cloud APIs Are Sufficient

On the other hand, cloud APIs are the more rational choice in the following cases:

The data being processed contains no confidential information (e.g., summarizing public information, generating internal FAQs)
Monthly token volume is low (see the break-even point discussed later)
The latest model performance is always required (cloud APIs update their models automatically)
There is no capacity to operate GPU infrastructure (no dedicated ML engineers)

When there is data that "cannot be sent to the cloud" and "monthly token volume is substantial to some degree," the return on investment for local execution starts to make sense.

Model Selection Table by GPU Requirements — Reverse Lookup from Budget and Hardware

The first stumbling block in local execution is figuring out "which models can run on which hardware." With the widespread adoption of MoE architectures, total parameter count alone is no longer sufficient to determine memory requirements. The goal here is to work backwards from active parameter count and quantization level to enable model selection that fits within your budget.

Model Selection Table (CPU Only / RTX 4090 / H100)

The table below organizes the major open-weight models by GPU memory requirements. Note that MoE models require less memory despite their large total parameter counts, because fewer parameters are active at any given time.

Model	Total Params	Active	Architecture	Q4 VRAM Est.	CPU Only	RTX 5090 (32GB)	H200 (141GB)	B200 (192GB)
Phi-4-mini	3.8B	3.8B	Dense	2.5 GB	✅ Fast	✅	✅	✅
Gemma 3 4B	4B	4B	Dense	3 GB	✅ Practical	✅	✅	✅
Gemma 3n E4B	4B	2B	MatFormer	3 GB	✅ Mobile-ready	✅	✅	✅
Qwen 3 7B	7B	7B	Dense	5 GB	△ Slow	✅ Comfortable	✅	✅
Phi-4	14B	14B	Dense	8 GB	△ Slow	✅ Comfortable	✅	✅
GPT OSS 20B	21B	3.6B	MoE	13 GB	△	✅ Comfortable	✅	✅
Gemma 3 27B	27B	27B	Dense	16 GB	✗	✅ Comfortable	✅	✅
Llama 4 Scout	109B	17B	MoE (16E/2A)	55 GB	✗	✗	✅	✅
GPT OSS 120B	117B	5.1B	MoE	61 GB	✗	✗	✅ (MXFP4)	✅
Llama 4 Maverick	400B	17B	MoE (128E)	200 GB+	✗	✗	✗	✅ (Q4)

GPU Spec Comparison (NVIDIA official): The RTX 5090 features 32GB GDDR7 with 1,792 GB/s bandwidth (Blackwell architecture). The H200 features 141GB HBM3e with 4.8 TB/s bandwidth. The B200 features 192GB HBM3e with 8 TB/s bandwidth. Memory bandwidth is a particularly critical metric for memory-bound LLM inference.

How to read this table: The "Q4 VRAM Est." for MoE models includes the weights of all experts. Even though fewer parameters are active at once, all expert weights must be loaded into memory.

The efficiency of GPT OSS 20B is worth highlighting. Of its 21B total parameters, only 3.6B are active. With MXFP4 quantization, it fits within approximately 13GB, allowing it to run comfortably within the RTX 5090's 32GB (and even the RTX 4090's 24GB). Despite this, it achieves scores on official benchmarks that rival OpenAI o3-mini.

For maximum accuracy on a single RTX 5090, GPT OSS 20B (MXFP4) is the top candidate. With a 78% improvement in bandwidth over the RTX 4090, meaningful gains in inference throughput can also be expected. The 32GB VRAM also leaves enough headroom to run Gemma 3 27B (Q4, 16GB) comfortably.

On a single H200, both GPT OSS 120B and Llama 4 Scout become viable options. GPT OSS 120B requires approximately 61GB with MXFP4, while Llama 4 Scout requires approximately 55GB with int4, plus KV cache. With 141GB of VRAM, there is ample room for KV cache even with long contexts (128K tokens). For text-focused inference, GPT OSS 120B is the better choice; for multimodal use cases requiring image understanding, Llama 4 Scout is preferable.

On a B200, even Llama 4 Maverick (400B total parameters) comes into range. Its 192GB of HBM3e and 8 TB/s bandwidth enable single-GPU inference with Q4 quantization. NVIDIA officially claims up to 15× inference performance improvement over the H100.

Practical Implications of Quantization (GGUF / GPTQ / AWQ / MXFP4)

Quantization is a technique that compresses model weights to low-bit formats (4-bit / 8-bit), reducing the required VRAM to less than half. With the emergence of GPT OSS, MXFP4 has been added as a new option alongside conventional quantization formats.

Regarding the relationship between quantization levels and accuracy, community validation reports have been accumulating. As a general trend, Q8 (8-bit) shows almost no detectable accuracy degradation compared to FP16, Q4 (4-bit) exhibits minor degradation depending on the task, and Q2 (2-bit) has been reported to cause clear quality loss. However, since the degree of degradation varies by model, task, and language, validation against your own use case is essential.

The practical recommendation is Q4 or higher. If the model fits in memory at Q4, running at Q4 is the efficient approach—only upgrading to Q8 if quality issues arise.

Quantization Format	Characteristics	Recommended Use
GGUF	Designed for llama.cpp. Supports mixed CPU / GPU inference	CPU only to single GPU
GPTQ	GPU-optimized. Strong for batch inference	Servers handling multiple requests
AWQ	Tends to be faster than GPTQ with less accuracy degradation	GPU servers where speed is a priority
MXFP4	4-bit format applied only to MoE weights. Optimized for Blackwell / Hopper GPUs	GPT OSS / H200, B200, RTX 5090 environments

MXFP4 is the quantization format adopted by OpenAI for GPT OSS, selectively applied to the expert weights of MoE. It delivers optimal throughput on Blackwell architecture (B200 / RTX 5090) and Hopper architecture (H100 / H200).

The benefits of quantization extend beyond VRAM reduction. The memory bandwidths of 1,792 GB/s for the RTX 5090, 4.8 TB/s for the H200, and 8 TB/s for the B200 create a synergistic effect with the reduction in weight size achieved through quantization. Even with the same model, switching to Q4 halves the amount of weight data transferred, alleviating bandwidth bottlenecks and improving token generation speed.

Choosing an Inference Framework (llama.cpp / vLLM / Ollama)

Simply downloading a model's weight files is not enough to run inference (text generation). An inference framework is responsible for the entire pipeline: loading the model onto the GPU, applying the tokenizer, managing the KV cache, and sampling output tokens. The choice of framework does not affect model accuracy, but it directly impacts inference speed, concurrency, and operational ease.

llama.cpp is a C/C++ inference engine that supports a wide range of environments, from CPU-only to GPU. Its strengths are its ability to load GGUF-format models directly and its minimal dependencies beyond a Python environment or CUDA drivers. A GGUF-converted version of GPT OSS is also available, and can be launched with llama-server -hf ggml-org/gpt-oss-120b-GGUF. It is well-suited for environments where setting up a Python environment is difficult, such as Windows PCs or edge devices.

vLLM is characterized by its highly efficient memory management via PagedAttention. In standard inference, the KV cache allocates contiguous memory per request, causing VRAM to become constrained as the number of concurrent requests grows. vLLM manages this in page-sized units, preventing memory fragmentation and enabling more requests to be processed in parallel with the same VRAM. GPT OSS officially supports vLLM and can be served with vllm serve openai/gpt-oss-120b --tensor-parallel-size 2. It is the top candidate when providing an internal API server to multiple departments.

Ollama is a CLI tool that wraps llama.cpp, completing everything from model download to launch with a single line: ollama run phi4. Model management (version switching, deletion) is also handled entirely through the CLI, making it ideal for the "just get it running" stage of PoC and prototyping. However, since it has no built-in request queuing or monitoring mechanisms, migration to vLLM or a similar solution will be necessary for production environments.

Aspect	llama.cpp	vLLM	Ollama
Primary use case	Edge / PC / CPU inference	Internal API server	PoC / Prototyping
Concurrent requests	Limited	Highly efficient via PagedAttention	Suited for single requests
Setup	Compile or binary	pip install	One command
GPU requirement	None (CPU supported)	CUDA required	None (CPU supported)
Production use	△ (monitoring is DIY)	✅ (OpenAI-compatible API)	✗ (migration recommended)

SLM vs Cloud API — Accuracy Comparison by Task

"I understand it runs locally, but is the accuracy acceptable?" — This is the most common question that comes up when considering adoption. The release of GPT OSS has significantly changed the answer to this question.

Task-Specific Benchmark Comparison Table

The following is a comparison of publicly available scores for local models and cloud APIs across major benchmarks. MMLU, HumanEval, and MATH scores are based on official figures from each model's release or third-party reproduction studies.

Task (Benchmark)	GPT OSS 20B	Phi-4 14B	GPT OSS 120B	Cloud API (Large)	Source
MMLU (General Knowledge)	—	78–82	87.2	88–92	OpenAI Model Card / Official sources
Code Generation (HumanEval)	—	80–86	89.4 (balanced) / 92.1 (deep)	88–94	OpenAI Model Card / Official sources
Math (MATH)	—	68–75	78.6 (balanced) / 84.3 (deep)	82–92	OpenAI Model Card / Official sources
Tool Calling (TauBench)	o3-mini equivalent	—	Exceeds o3-mini / o4-mini equivalent	—	OpenAI Official Blog

Note: Individual benchmark scores for GPT OSS 20B are summarized in the official model card as "comparable to o3-mini," but task-specific detailed scores are partially undisclosed. The "o3-mini equivalent" designation above is based on OpenAI's official comparison.

For text classification, short-form summarization, structured data extraction, and multilingual translation, no standardized public benchmarks have been established, making in-house validation on proprietary data essential for comparing model accuracy. As a general trend, tasks with short, formulaic inputs and outputs (such as classification and extraction) show smaller performance gaps between SLMs and cloud APIs, while tasks requiring long contexts or complex reasoning tend to favor cloud APIs.

Tasks Where SLMs Win and Where They Lose

The emergence of GPT OSS has expanded the areas where local models approach cloud APIs on public benchmarks.

Benchmark areas where local models are strong now include code generation (HumanEval) and tool calling (TauBench). GPT OSS 120B surpasses o3-mini on TauBench and records a score comparable to o4-mini (OpenAI official blog). GPT OSS 20B is also considered on par with o3-mini.

Areas where cloud APIs still hold an advantage are complex reasoning and mathematics (MATH benchmark) and multilingual tasks. Even with GPT OSS 120B's "deep mode" (a mode that takes more time for reasoning), a gap with large cloud models remains.

However, benchmark scores and real-world task accuracy do not necessarily align. Benchmarks are measured using standardized prompts and evaluation criteria, whereas real-world tasks differ in input data quality, domain-specific vocabulary, and output format requirements. While benchmarks should serve as a reference for adoption decisions, PoC validation using your own data is essential.

Closing the Accuracy Gap: A PEFT Customization Pass

If general-purpose accuracy is insufficient, it can be improved by specializing the model on your own data using PEFT (Parameter-Efficient Fine-Tuning).

From a licensing perspective, GPT OSS uses Apache 2.0, which places no restrictions on fine-tuning or commercial use. Llama 4 Scout, on the other hand, uses the Llama 4 Community License, which requires a separate license for companies with more than 700 million monthly active users. For most organizations this is not a practical constraint, but it is worth noting that the terms differ from Apache 2.0.

Among PEFT methods, LoRA and QLoRA are the most widely used in practice.

LoRA is a technique that adds low-rank update matrices to the model's weight matrices, training only approximately 0.1–1% of the total parameters as additional parameters. It enables LoRA fine-tuning of Phi-4 14B on a single RTX 4090.

QLoRA combines LoRA with quantization to further halve the required VRAM. Since it allows fine-tuning of 14B models even on 16 GB-class GPUs, it is a strong option when looking to minimize upfront investment.

A typical customization workflow is as follows:

Base model selection: Choose a model that fits your hardware from the selection table above
Data preparation: Create 500–5,000 training samples from business data (input/output pairs)
Fine-tuning with QLoRA: Training time depends on dataset size, LoRA rank, and number of epochs. As a rough guide, expect several hours on an RTX 4090 with 1,000 samples, rank 16, and 3 epochs
Evaluation: Validate accuracy on an internal test set and confirm the improvement over the base model
Deployment: Deploy the merged model on llama.cpp / vLLM

The degree of accuracy improvement achieved through domain-specific fine-tuning depends heavily on the base model's general-purpose performance, the quality and quantity of training data, and the nature of the task. It is recommended to first confirm the effect through a small-scale pilot (100–200 samples) before proceeding with full-scale data preparation.

Cost Comparison — Local vs Cloud from a TCO Perspective

Even if local execution proves viable in terms of accuracy, the business case won't hold up if the costs don't make sense. Here we estimate the break-even point from both the initial investment and monthly operating cost perspectives.

Initial Investment and Operational Cost Estimation Model

Cost Item	Cloud API	SLM (RTX 5090)	Local LLM (H200)
Initial Investment	¥0	¥600K–¥1M (reference)	¥5M–¥8M (reference)
Monthly API / Electricity	Proportional to token volume	Electricity ¥10K–¥30K	Electricity ¥50K–¥100K
Operational Labor	Nearly zero	ML engineer (can be part-time)	ML engineer (dedicated recommended)
Model Updates	Automatic	Manual (approx. once per quarter)	Manual
Operable Models	—	GPT OSS 20B, Phi-4, Gemma 3 27B	GPT OSS 120B, Llama 4 Scout

Note: Hardware prices are subject to significant fluctuation depending on market conditions. The RTX 5090 has been reported trading well above its MSRP ($1,999) due to DRAM shortages. Be sure to check the latest market prices when considering adoption.

The cost-effectiveness of the RTX 5090 is drawing attention. The fact that GPT OSS 20B (equivalent to o3-mini according to official benchmarks) can run on a 32GB consumer GPU is a factor lowering the break-even point for local AI.

Cloud API pricing varies significantly by provider, model, and contract type, and is subject to frequent revision. Always check the latest pricing pages from each provider when considering adoption. As a general trend, the higher the monthly token volume, the more advantageous the fixed-cost model of local execution becomes.

Break-Even Point by Monthly Token Volume

For an RTX 5090 server (assuming an initial investment of ¥800,000–¥1,000,000 and monthly operating costs of ¥30,000–¥50,000; prices are for reference only), the break-even point against cloud APIs depends on monthly token volume. Compared to the RTX 4090, the RTX 5090 increases VRAM from 24GB to 32GB and improves memory bandwidth by 78%, enabling smoother operation of MoE models such as GPT OSS 20B.

Monthly Token Volume	Conditions Favoring Local Deployment
Under 1 million tokens	Cloud API pay-as-you-go pricing is often cheaper
5 million–20 million tokens	Monthly cloud API costs begin to exceed local fixed costs. Payback period ranges from a few months to 2 years, depending on API pricing
Over 20 million tokens	The local fixed-cost model holds a clear advantage. The cost per token decreases as processing volume increases

A key assumption underlying this estimate is that the monthly operating cost of a local SLM is largely independent of processing volume. Because a GPU server follows a fixed-cost model, the cost per token decreases as processing volume increases. Conversely, for light usage with low monthly token volumes, cloud API pay-as-you-go pricing will typically be cheaper.

While the RTX 5090 requires a higher initial investment than the RTX 4090, its 32GB of VRAM allows Gemma 3 27B (Q4, 16GB) to run with headroom to spare, broadening the range of available models. Whether the additional investment is justified should be evaluated based on the required model size and processing volume.

The specific break-even point varies significantly depending on the cloud API provider's model, pricing structure, and contract terms. When evaluating adoption, it is recommended to measure your organization's monthly token volume over one to two weeks and then compare cloud API quotes against the fixed costs of local execution.

Common Implementation Failures and Workarounds

Here are 3 failure patterns I've repeatedly observed in local LLM / SLM deployments. Each is technically addressable, but knowing them in advance will help you avoid unnecessary detours.

1. The Misconception That "Bigger Model = Higher Accuracy"

There are cases where attempting to force a 70B model to run results in degraded quality due to excessive quantization (Q2). In many cases, running a 14B model at Q4 yields higher accuracy than running a 70B model at Q2. Using a model sized to fit your hardware at Q4 or higher ultimately optimizes both accuracy and cost.

Referring back to the model selection table by GPU requirements mentioned earlier, the practical answer is to choose the largest model that can run at Q4 or higher within your own GPU memory.

2. Quantization Pitfalls Due to Insufficient GPU Memory

Another often-overlooked factor is memory consumption from the KV cache during inference. Even if the model itself is 8GB, processing long prompts (exceeding 4,000 tokens) can consume an additional 2–4GB for the KV cache. Even on a 24GB RTX 4090, a 14B model (Q4, 8GB) + KV cache + OS can effectively exhaust 18–20GB.

There are two workarounds. The first is to limit the context length to the minimum required for your use case (if 4,096 is sufficient, don't set it to 8,192). The second is to use vLLM's PagedAttention to manage the KV cache efficiently.

3. Lack of Operational Framework (Model Updates & Monitoring)

Even when the initial PoC succeeds, there are cases where systems are left unattended during the operational phase. Open-weight models receive new versions every few months. As seen with Phi-3 → Phi-4 and Gemma 2 → Gemma 3, accuracy improves significantly with each new generation, even at the same parameter count.

As a minimum operational framework, it is recommended to incorporate quarterly model update evaluations along with regular monitoring of inference speed and error rates. Even without a dedicated ML engineer, this is work that an IT administrator can automate with scripts.

FAQ

Q1: How should GPT OSS and GPT API (Cloud) be used differently?

If data can be sent externally, cloud APIs are the more convenient option. GPT OSS should be chosen when there are data sovereignty constraints, when API costs need to be converted to fixed expenses, or when operation in an offline environment is required. According to OpenAI's official benchmarks, GPT OSS 20B achieves scores comparable to o3-mini on MMLU, HumanEval, TauBench, and other benchmarks, while the 120B model achieves scores comparable to o4-mini. However, since benchmark scores do not guarantee accuracy in real-world business tasks, PoC validation on your own use cases is essential.

Q2: Can RAG be used with SLMs?

Yes, it's possible. RAG (Retrieval-Augmented Generation) is a technique that embeds documents retrieved via search into prompts, so it is not dependent on model size. By combining GPT OSS 20B or Phi-4 with a vector DB (such as Qdrant or pgvector), you can build a pipeline for internal document search and answer generation. Since GPT OSS supports a context length of 128K tokens and Llama 4 Scout supports up to 10M tokens, there is plenty of flexibility when designing chunk sizes for RAG.

Q3: How does the response speed of a local LLM compare to cloud APIs?

Depending on the hardware, local deployment can sometimes achieve lower latency for single requests, as cloud APIs incur additional overhead from network round trips and queue wait times. For GPT OSS 20B on an RTX 4090, token generation may begin faster than with a Dense 14B model, since MoE active parameters are as few as 3.6B. However, as the number of concurrent requests increases, the GPU becomes a bottleneck, making it necessary to design for throughput using batch inference frameworks such as vLLM. Since actual latency varies significantly depending on the model, quantization, prompt length, and batch size, measurement in your own environment is recommended rather than relying on benchmark values.

Q4: Can SLMs Customized with LoRA / QLoRA Withstand Production Use?

Production deployment experience is growing, as evidenced by case reports from the open-source community and cloud providers. There are two main points to keep in mind when deploying a QLoRA fine-tuned SLM into production. The first is training data quality control — training on noisy data can cause "catastrophic forgetting," where general performance degrades. The second is version management of LoRA adapters, as adapter retraining becomes necessary when the base model is updated. Technical details on fine-tuning are covered in the introductory article on PEFT.

Summary

The decision to adopt a local LLM / SLM comes down to three factors: "data sovereignty requirements," "monthly token volume," and "task complexity." The emergence of GPT OSS has made this decision even simpler. If there are constraints preventing data from being sent to the cloud and you need to process 5 million or more tokens per month, GPT OSS 20B (RTX 4090) or GPT OSS 120B (H100) can deliver accuracy equivalent to the cloud API's o3-mini / o4-mini locally.

Start small by validating with Ollama + GPT OSS 20B, and once you are satisfied with the accuracy, fine-tune it on your proprietary data using QLoRA, then productionize it with vLLM. Following these steps allows you to minimize initial investment while achieving both data sovereignty and cost optimization.

Author & Supervisor

Yusuke Ishihara

Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).