
Fine-tuning is a technique that further trains and fine-tunes the weight parameters of an existing large language model (LLM) to optimize it for specific tasks or domains. This article organizes the criteria for determining whether B2B companies should pursue custom model development, framing it in relation to prompt engineering, RAG, and PEFT.
Drawing on our on-the-ground experience supporting B2B AI adoption in Thailand, we provide an in-depth look at cost estimates and operational pitfalls. We hope this serves as a starting point for internal evaluation.
Many companies halt their consideration of fine-tuning due to its image as "advanced and expensive." While the emergence of PEFT/LoRA has expanded options that can be started at an order-of-magnitude lower cost, hasty adoption can result in wasted development costs on problems that RAG would have been sufficient to solve.
Fine-tuning is a leading technique for adapting general-purpose LLMs to a company's unique business context. We begin by clarifying its definition and its relationship to adjacent technologies—prompt engineering and RAG—before examining the structural reasons it is attracting attention among B2B companies.
Fine-tuning is a technique that performs additional training on a pre-trained language model using a company's own data, adjusting the output style and handling of specialized terminology to suit a given purpose. If pre-training is the process of creating a model with "general knowledge," then fine-tuning is the equivalent of "specialized professional training."
Technically, the model's weights are updated in the direction that minimizes error against the input data. Depending on the scope of parameters updated, this is broadly divided into Full Fine-tuning, which targets all layers, and PEFT (Parameter-Efficient Fine-Tuning), which trains only a subset.
The data used for training primarily consists of supervised data pairing inputs with ideal outputs. The concept is to transfer into the model the operational knowledge a company has accumulated—such as "respond to this inquiry in this way" or "produce this kind of summary from this contract."
In Thailand's B2B environment, fine-tuning is often considered for the purpose of reproducing business writing styles used in day-to-day communication with customers across three languages: Japanese, English, and Thai. It is attracting attention as a means of producing output with "expressions that hold up in practical local use"—something that general-purpose LLMs struggle to deliver.
There are broadly three approaches to equipping an LLM with specialized knowledge: prompt engineering, RAG (Retrieval-Augmented Generation), and fine-tuning. These three are not competing alternatives but complementary techniques that serve different purposes.
Prompt engineering is a method that controls model behavior solely through crafting instruction text, without modifying the model's weights. Implementation costs are virtually zero and results can be validated the same day, but sending lengthy instructions with every request increases token costs and latency.
RAG is an approach that retrieves relevant information from an external document database or vector store and inserts that content into the prompt to guide generation. It is well-suited for providing up-to-date information or internal company knowledge as "facts," making it the first choice in domains that deal with factual knowledge.
Fine-tuning rewrites the behavior of the model itself. It is effective when what is needed is not knowledge but the internalization of a skill—such as a specific writing style, JSON formatting, or the interpretation of industry-specific jargon.
A useful starting point for decision-making is: "Use RAG for a lack of facts, fine-tuning for a lack of skills, and prompting if simple instructions suffice." For detailed selection criteria, refer to How to Choose Between Fine-Tuning and RAG (slug: fine-tuning-vs-rag-comparison).
Three structural shifts underlie the growing viability of fine-tuning as a practical option in the B2B space.
The first is the dramatic reduction in training costs brought about by the emergence of PEFT/LoRA. Whereas traditional full fine-tuning once required costs on the order of tens of millions of yen, using LoRA has made it increasingly possible to complete the work with cloud GPU usage fees ranging from tens of thousands to over a hundred thousand yen, bringing it within reach of proof-of-concept budgets.
The second is that general-purpose LLMs have begun to show their limitations—struggling with industry-specific terminology and failing to reproduce a company's own tone. In operations that demand both specialization and strict formatting, such as manufacturing inspection reports or financial credit documents, accuracy hits a ceiling with prompt tuning alone.
The third is the context of data sovereignty. Under Thailand's PDPA and Japan's amended Act on the Protection of Personal Information, sending sensitive data to external APIs is becoming an audit concern, and custom models running on-premises or within a VPC are emerging as a viable alternative.
However, the fact that fine-tuning is attracting attention is a separate question from whether a given company should pursue it right now. The following sections examine the criteria for making that determination.
Fine-tuning is not a single technique but a collection of approaches that differ in the range of parameters targeted for training and the type of training signal used. Understanding this taxonomy is the starting point for designing the right balance of deployment scale, cost, and accuracy.
Full Fine-tuning is the traditional approach in which all of a model's parameters are subject to training. While it offers the highest expressive capacity, the GPU memory, computation time, and energy costs required for training are enormous.
As a practical benchmark, full fine-tuning a 7B-class open model requires multiple high-end GPUs running for tens of hours. At cloud GPU rates, a single training run can easily fall in the range of hundreds of thousands to several million yen.
Full fine-tuning also carries the risk of "catastrophic forgetting" — a phenomenon in which the model loses general capabilities acquired during pre-training in exchange for learning a new task. Maintaining generality requires careful mixed-dataset design and precise learning rate control, which drives up engineering costs further.
In typical B2B projects, full fine-tuning is chosen only when building a dedicated model for a niche domain with strong differentiation potential, or when complete weight ownership is required for licensing or data sovereignty reasons. The practical approach is to first validate PEFT, and only move to full fine-tuning once its limitations become apparent.
PEFT (Parameter-Efficient Fine-Tuning) is an umbrella term for parameter-efficient methods that train only a few percent or less of a model's total parameters. These methods keep training and operational costs one to two orders of magnitude lower than full fine-tuning, while still delivering practically sufficient accuracy for many business tasks.
The most representative techniques are LoRA (Low-Rank Adaptation) and its quantized variant, QLoRA. LoRA keeps the base model frozen and instead adds small low-rank matrices to each layer, training only those additions. A key advantage is that even a single consumer-grade GPU can be used to fine-tune a 7B-class model.
QLoRA quantizes the base model to 4-bit or similar precision before applying LoRA, cutting memory consumption roughly in half. It is widely used as a practical solution in on-premises environments in Thailand where GPU budgets are constrained.
LoRA training results can be saved as independent adapter files of tens to hundreds of megabytes, and managed as a "collection of files" that can be swapped by use case (e.g., one for sales documents, another for technical translation). For more details, see Introduction to PEFT (slug: peft-introduction).
Fine-tuning also changes in character depending on how the training signal is provided. The primary distinction is between supervised fine-tuning (SFT) and reinforcement learning-based approaches such as RLHF and DPO.
SFT is a straightforward method that trains on paired data of the form "this input maps to this correct output." It is the workhorse of practical business applications. The training process is intuitive and failures are relatively easy to diagnose, making SFT the standard starting point for leveraging proprietary data in B2B contexts.
RLHF and DPO involve having humans comparatively evaluate multiple candidate outputs and steering the model toward producing "more preferred" responses. These approaches are powerful when there is no single correct answer but tone or safety needs to be refined — however, the difficulty of collecting evaluation data and the operational overhead are both significantly higher than SFT, making them unsuitable for initial B2B deployments.
DPO is gaining traction as an alternative that enables preference learning from "preferred / not preferred" output pairs without requiring the complex reinforcement learning loop of RLHF. It is easiest to think of it as a finishing step applied after SFT.
A realistic roadmap for selection is: first use SFT to capture 80% of the quality target, then apply DPO or RLHF as needed to close the remaining 20%.
The effectiveness of fine-tuning varies significantly depending on the type of problem being addressed. Whether the investment pays off is often determined less by the soundness of the technology selection and more by whether the task at hand is one that fine-tuning is actually well-suited to solve.
Fine-tuning delivers the greatest impact in domains where a general-purpose LLM "knows the words but misreads the context." In Thai B2B operations, the return on investment is highest for the following three types of challenges.
The first is interpreting industry terminology and company-specific expressions. There are many terms—shipping documents in logistics, inspection standards in manufacturing, drug names in healthcare—that a model may know in isolation yet handle incorrectly within the broader context of a document. Training on representative in-house examples via SFT significantly reduces the rate of misinterpretation.
The second is establishing consistent output style. This includes proposals written to internal templates, recurring reports in a fixed format, and customer communications that adhere to brand tone. When inconsistencies persist no matter how carefully the prompt is crafted, embedding the "pattern" through fine-tuning is the most efficient solution.
The third is stabilizing structured output. When JSON or YAML must be returned in a defined schema every time, relying on prompts alone will produce omissions or deviations at a non-trivial rate. Training on several hundred to several thousand correct examples via SFT dramatically improves format compliance.
The common thread is that the goal is not "correctness of the answer" but rather "stability of behavior and format."
A prime example of where fine-tuning is ill-suited is domains where knowledge changes frequently. Because knowledge learned through fine-tuning is baked into the model's weights, updating that information requires retraining. The update lead time being out of step with business velocity is a critical drawback.
When data that changes daily or weekly—such as the latest product specifications, current month's campaign details, or up-to-date customer contact records—is handled via fine-tuning, the operation ends up running training every week. For managing the same facts, RAG with references to an external database is far easier to maintain.
Fine-tuning is also unsuitable when the goal is simply to load a large body of factual knowledge. The essence of fine-tuning is "learning behavior," not memorizing a dictionary. Rather than trying to memorize 100,000 FAQ entries, a retrieval-based RAG approach is superior in both accuracy and cost.
It should also be avoided when data volume is insufficient. Attempting fine-tuning at a stage where only around 100 training pairs have been collected carries a high risk of overfitting and degrading existing capabilities. A stable outcome generally requires a minimum of several hundred to several thousand examples.
RAG and fine-tuning are not competing technologies; the modern approach is to design a roadmap that assumes they will be used in combination. The decision can be organized along three axes.
The first axis is knowledge versus behavior. If factual information needs to be retrieved, use RAG; if the goal is to stabilize writing style, format, or the handling of specialized terminology, use fine-tuning. When both are needed, the standard approach is to use fine-tuning to align the model with the industry's style, then supply up-to-date information via RAG.
The second axis is update frequency. If content changes daily or weekly, use RAG; if updates on a semi-annual or annual basis are sufficient, fine-tuning becomes a viable option. For RAG-based infrastructure design, refer to Introduction to Vector Databases (slug: vector-database-guide).
The third axis is latency and cost sensitivity. Because RAG triggers a search at inference time, response latency increases and token costs accumulate. At high usage frequency, lightweight inference on a fine-tuned model wins on TCO. For low-frequency, high-accuracy use cases, the flexibility of RAG is advantageous.
In practice, implementations often settle on a hybrid design of "using fine-tuning to lock in behavior and RAG to supply facts." For specific routing approaches, refer to the Hybrid LLM × SLM Design Guide (slug: hybrid-llm-slm-routing-design-guide).
From PoC through to production operation, the essence lies not in designing the technical training cycle, but in building the mechanisms for data, evaluation metrics, and operational feedback. We will examine this across three phases.
The outcome of fine-tuning is determined almost entirely by the quality of the dataset. The majority of PoCs that conclude "it didn't work as well as expected" trace the root cause to the data side, not to model selection.
The first thing to decide is the data format. For supervised learning, JSON Lines pairing inputs with ideal outputs is the standard, and stratified sampling should be used to ensure that variation across pairs—customer type, inquiry category, language—is not skewed.
Next, establish a minimum quality threshold. Even a few percent of pairs containing typos or incoherent context is enough to destabilize training results. In Thai trilingual operations (Japanese, English, and Thai), inconsistent translations easily become noise, so having local staff conduct a final review is the practical approach.
As a rough guide for data volume, aim for 500–1,000 pairs for stable style acquisition and 3,000 or more for learning complex behaviors. A strategy that prioritizes quality over quantity is more directly tied to results.
The handling of sensitive information must not be overlooked. PII such as customer names, business partner names, and contract terms should be masked before training. Neglecting this creates a risk that the model will reproduce parts of the training data verbatim at inference time, and it also poses issues for PDPA compliance.
Base model selection is determined by combining three criteria: license, size, and deployment environment.
On the licensing front, always verify commercial use terms, restrictions on distributing derivative models, and ownership of training outputs. Open models may come with conditions such as academic-use-only restrictions or requirements for separate contracts with large enterprises—conditions that can become a dead end for future business expansion.
For size, use "the minimum capability threshold required for the task" as your benchmark. For relatively straightforward tasks such as summarization or classification, a 3B–7B class model is sufficient and keeps inference costs down. For advanced multilingual dialogue or complex reasoning, the 13B–70B class comes into consideration. Larger models yield higher accuracy, but inference latency and GPU memory requirements rise sharply.
Deciding on the deployment environment upfront (cloud API-based vs. on-premises self-hosted) narrows down the options. In Thai manufacturing settings, there is a growing trend toward selecting open models for on-premises deployment due to requirements that sensitive data not leave the premises.
For evaluation metrics, prioritize your own business KPIs over general-purpose benchmarks (such as MMLU). For automated proposal generation, continuously measure metrics such as "template compliance rate," "industry terminology accuracy," and "number of rejections during manual review."
Training should not be treated as a one-time run, but designed as a cycle that incorporates evaluation and feedback. Run multiple loops of: data preparation → training → automated evaluation → manual review → supplementing data for gaps → retraining.
For the initial training run, start with conservative hyperparameter values (low learning rate, few epochs). Correcting overfitting after the fact tends to waste both time and GPU costs, so starting from the safe side and tuning incrementally is ultimately faster.
In the evaluation phase, combine automated evaluation directly tied to business KPIs with manual review. Automated evaluation checks template compliance rate and format error rate, while manual review assesses "naturalness of subtle phrasing" and "whether anything feels out of place in the industry context." For Thai operations, qualitative evaluation by local staff serves as the final line of quality assurance.
Even after production deployment, it is important to have a mechanism for accumulating operation logs as a source for retraining data. Convert user edit histories, complaint cases, and text rewritten by support staff into training data, and perform incremental training on a quarterly basis so the model can keep pace with changes in business operations.
For cost reduction techniques, refer to the LLM Cost Optimization Guide (slug: llm-cost-optimization-guide).
Looking only at the initial training cost when evaluating fine-tuning expenses gives an incomplete picture. Assessing from a TCO perspective—including post-deployment operations, updates, and risk management—is also important for obtaining internal approval.
The starting point for cost estimation is the combination of three factors: base model size, training method, and volume of training data. For training a 7B-class model on 5,000 pairs using LoRA, a standard estimate is 6–12 hours on 1–2 high-end GPUs, with a single training run typically falling in the range of tens of thousands to hundreds of thousands of yen.
By contrast, full fine-tuning on a model of 13B or larger requires more GPUs and longer training times, resulting in costs one to two orders of magnitude higher. Since the cost difference is larger than the accuracy difference, it is a cardinal rule to first verify whether PEFT can meet your requirements.
Even more easily overlooked than training costs are inference costs. Whether you maintain GPU availability around the clock or use serverless inference with on-demand startup makes a significant difference in monthly expenses. For low-frequency batch inference, serverless is practical; for high-frequency use, keeping a container running continuously is the realistic choice.
Designing a multi-model strategy with inference routing can minimize the uptime required for fine-tuned models. The approach to combining models is covered in the Hybrid LLM × SLM Design Guide (slug: hybrid-llm-slm-routing-design-guide).
Fine-tuning is not a "build once and done" endeavor. Regular retraining will occur in response to changes in business operations, regulatory updates, and the addition of new services. Failing to include this update cycle in your estimates will lead to budget overruns during the operational phase.
Design the retraining frequency to match the pace of change in your business domain. For domains that remain stable over years—such as manufacturing inspection standards—every six months to once a year is appropriate. For domains that shift significantly each quarter—such as marketing or customer support—every three months is a reasonable target.
It is not necessary to retrain from a full dataset each time. Incorporating "continual learning," where only incremental data is used for additional training, can compress computational costs to roughly one-third. However, continual learning increases the risk of catastrophic forgetting, so it is important to mix in a portion of past data during training to mitigate this.
Having an evaluation and rollback mechanism in place is also critical. Before deploying to production, compare against the previous version using a fixed evaluation set to confirm there is no regression. Set up version control and automated adapter switching in advance so that, if degradation is detected, you can immediately revert to the previous version.
A risk unique to FT is the "memorization" phenomenon, in which portions of the training data are retained by the model and unintentionally reproduced during inference. It has been noted that customer data containing proper nouns and specific numerical values carries the risk of being partially reproduced during extended text generation.
There are three pillars of countermeasures. First, mask PII in the training data in advance. Customer names, transaction amounts, contract terms, and similar information should be replaced with tokens or anonymous IDs before being fed into training. Second, conduct red-teaming against the trained model by deliberately attempting to elicit sensitive information through prompts. Third, if a leakage risk remains, exclude the relevant data and retrain the model.
The intellectual property risk perspective cannot be ignored either. If copyrighted content is mixed into the training source data, similar expressions may appear in the generated output. It is advisable to manage the base model's license and the provenance of training data together, and to establish a mechanism for periodically auditing whether commercial use conditions are being met.
Under data protection laws including Thailand's PDPA, notification to and consent from data subjects when using sensitive information for training becomes a key issue. Involving the legal and compliance teams early and documenting the data usage policy is effective in preventing rework at later stages.
Whether to proceed with fine-tuning is more a matter of management judgment than a technical one. The following is a summary of items to verify in internal discussions. At any stage where a "Yes" cannot be given to each item, it is prudent to remain in the phase of producing results with RAG and prompt engineering first.
First, the nature of the problem. Is what you are seeking "stability in behavior and style," or "up-to-date factual knowledge"? For the latter, RAG is the primary candidate, and FT is often unnecessary.
Second, data requirements. Can you prepare at least several hundred, and ideally several thousand, high-quality training pairs? Are they in a state where they can be extracted from operational logs accumulated internally?
Third, PoC budget. Can you put together an estimate of several hundred thousand yen for a PoC premised on LoRA? Can you anticipate a budget one order of magnitude higher for scaling out to Full FT?
Fourth, operational structure. Can engineering resources capable of regularly executing retraining, evaluation, and rollback be secured internally or through an external partner?
Fifth, data protection and compliance requirements. Is alignment with the PDPA and internal regulations in place, and has the policy for handling sensitive information been agreed upon with the legal team?
Projects where all five items can be answered with "Yes" are highly worth investing in FT. If even one answer is "No," prioritize addressing that item first.
The following is a concise summary of representative questions we receive when B2B organizations are considering fine-tuning.
Q1: What is the minimum budget to get started? For a PoC using LoRA, many cases can be started from several hundred thousand yen, covering data preparation, training, and evaluation. Proceeding to Full FT requires a budget one order of magnitude larger.
Q2: How many data samples are needed to see results? A rough guideline is 500–1,000 samples for stable style adoption, and 3,000 or more for learning complex behaviors. Ensuring the quality of the pairs has a more direct impact on results than simply increasing the number of samples.
Q3: Is it possible without in-house engineers? LoRA/QLoRA is increasingly available as managed services, making it relatively easy to outsource the training itself if you can prepare the data. However, evaluation design, data quality management, and ongoing operations require in-house expertise.
Q4: Is there a risk that general-purpose capabilities will degrade after FT? With Full FT, the risk of catastrophic forgetting is high, but with LoRA, the base model's weights are not modified, so the impact is limited. It is advisable to incorporate general-purpose benchmark evaluations as regression tests.
Q5: Should I start with RAG or FT? In almost all cases, you should start with RAG. The recommended sequence—resolving factual knowledge gaps with RAG first, then considering FT only once the remaining "behavioral and formatting issues" become apparent—is the safer approach in terms of both investment efficiency and operational sustainability.
Fine-tuning delivers significant results in situations where the limitations of a general-purpose LLM are concentrated in "behavior, style, and handling of specialized terminology." On the other hand, if the challenge is retrieving up-to-date information or refreshing factual knowledge, RAG and prompt engineering should be considered first.
Decision-making for B2B companies becomes simpler when approached in three stages. In the first stage, thoroughly apply RAG and prompt engineering to address the domain of factual knowledge. In the second stage, if the remaining challenges are concentrated in "unstable behavior," "inconsistent formatting," or "handling of industry-specific terminology," proceed to a small-scale PoC using PEFT/LoRA. In the third stage, once the PoC has produced results and the three conditions of data, operational structure, and budget are all in place, consider a full-scale investment in Full FT or a proprietary model.
Based on our on-the-ground experience supporting B2B AI adoption in Thailand, the majority of projects can achieve sufficient results at the second stage. Scaling out to Full FT is limited to a subset of cases where uniqueness is directly tied to competitive advantage and data sovereignty requirements are stringent.
Fine-tuning is best understood not as a binary "do it or don't" choice, but as one tool in a toolkit for selecting the approach that addresses your organization's specific challenges at the smallest possible cost.

Yusuke Ishihara
Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).