Choosing Between Fine-Tuning and RAG: A Practical Guide Comparing Cost, Accuracy, and Use Cases

AI & Machine Learning LLM Operations & RAG

Updated:April 6, 2026Published:April 6, 2026

Choosing Between Fine-Tuning and RAG: A Practical Guide Comparing Cost, Accuracy, and Use Cases

Choosing Between Fine-Tuning and RAG to Make AI Smarter with Internal Data: The Decision Significantly Impacts Both Cost and Accuracy

When leveraging Generative AI in business operations, fine-tuning and RAG (Retrieval-Augmented Generation) represent two leading approaches to the question of "how to incorporate internal data." The former directly updates the weights of an LLM (Large Language Model) to embed knowledge, while the latter retrieves external documents in real time to generate responses.

This article is intended for engineers, product managers, and IT system administrators considering AI adoption. It organizes both methods along four axes—cost, accuracy, update frequency, and security—to help you make the right choice for your organization's use case.

By the end, you will be able to make concrete judgments about "why combining the two approaches is effective" and "under which conditions each should be prioritized."

When applying LLMs in a business context, the question of "how to incorporate internal knowledge" is unavoidable. The two leading methods that answer this question are fine-tuning and RAG (Retrieval-Augmented Generation).

The two approaches differ fundamentally. Fine-tuning updates the model's weights directly, while RAG retrieves external documents and incorporates them into the response. Each has its own strengths and limitations, and there are reported cases where choosing the wrong method for a given use case has led to unexpected costs or degraded accuracy.

Let's start by reviewing the basic mechanisms of each method to build a foundation for making that choice.

How Fine-Tuning Works and Its Impact on the Model

Fine-tuning is a technique that adapts a pre-trained Foundation Model to a specific task by re-training its weight parameters on additional data. Because the model itself "internalizes" the knowledge, there is no need to reference external data at inference time.

Training workflow

Prepare supervised training data (input–output pairs)
Update parameters while minimizing the loss function
Deploy the updated model for inference

The impact on the model manifests in two key ways. The first is output style fixation. The model becomes capable of consistently reproducing specific writing styles and formats, such as the standard phrasing of legal documents or the notation conventions of medical records. The second is domain vocabulary reinforcement. Through training, the model tends to handle industry-specific terminology and expressions not present in the BPE tokenizer (Byte-Pair Encoding Tokenizer) vocabulary more effectively.

However, there are also constraints to be aware of.

High training cost: Full fine-tuning consumes large amounts of GPU (Graphics Processing Unit) resources
Knowledge freshness degrades easily: Information that arises after training is not reflected in the model
Hallucination risk: The model may attempt to fill in facts not present in the training data

PEFT (Parameter-Efficient Fine-Tuning) and LoRA have become widely adopted as cost-reduction techniques, updating only a subset of low-rank matrices rather than all parameters, which significantly reduces training costs. Nevertheless, since the model's weights are being modified, re-training and re-deployment remain necessary with each update. The key distinction from RAG, covered in the next section, lies precisely in this characteristic: knowledge becomes fixed inside the model.

How RAG Works and the Retrieval-Augmented Generation Flow

RAG (Retrieval-Augmented Generation) is a technique in which, before an LLM generates a response, relevant information is retrieved from an external knowledge source and incorporated into the prompt as context. Because the model's parameters themselves are not modified, knowledge updates are completed entirely through operations on the database side.

Basic processing flow

Convert the user's query into an embedding
Perform a similarity search against a vector database to retrieve relevant chunks
Insert the retrieved documents into the system prompt or context window
The LLM generates a response based on the augmented context

This mechanism allows information the model did not know at training time to be reflected in its responses, provided the relevant documents have been registered. For example, when targeting internal regulations or product manuals that are revised monthly, simply updating the documents allows the latest information to be reflected without retraining the model.

Key techniques for improving retrieval accuracy

Hybrid search: Combining vector search with BM25 (keyword search) via RRF to achieve both recall and precision
Chunk size tuning: Granularity that is too coarse introduces irrelevant information, while granularity that is too fine causes loss of context
Agentic RAG: Handling complex queries through multi-step reasoning with repeated retrieval passes

On the other hand, RAG is constrained by the amount of information that fits within the context window, and knowledge that is not surfaced by the search will not be used in the response. Furthermore, since grounding quality depends heavily on the design of the retrieval engine, the quality of index design and preprocessing directly determines overall output quality.

A Common Misconception: The Assumption That You Must Choose One or the Other

In discussions of fine-tuning versus RAG, the misconception that one must choose one or the other tends to spread easily. In practice, however, architectures that combine both methods are often effective.

This misconception arises from several factors:

The explanation that "fine-tuning = baking knowledge into the model" takes on a life of its own, making RAG appear unnecessary
The expectation that "RAG can answer anything by searching" leads to undervaluing customization of the model itself
Decisions are made based solely on a comparison of implementation costs, overlooking differences in accuracy and operational overhead

What is the reality? Fine-tuning excels at improving the consistency of output style and response format, but struggles to keep pace with the latest documents. RAG is strong at real-time knowledge retrieval, but if the model's own vocabulary and writing style are not aligned with the business domain, there are cases where it cannot effectively leverage the retrieved results.

In other words, the two are not competing approaches but rather complementary ones.

As a concrete example, consider a chatbot for the legal domain. A configuration in which fine-tuning is used to acquire the writing style and output format specific to legal documents, while the latest case law and regulatory amendments are referenced via RAG, makes it easier to achieve both accuracy and up-to-dateness simultaneously.

The next section organizes the evaluation criteria for deciding which approach to prioritize. Clarifying "what matters most" is the starting point for selecting the optimal method.

How to Define the Comparison Criteria

To fairly compare fine-tuning and RAG, it is important to define your evaluation criteria first, rather than asking "which one is better?"

The four main axes of comparison are cost, accuracy, update frequency, and security. Furthermore, by classifying use cases into "knowledge injection" and "style adaptation" types, it becomes clearer which approach is fundamentally more suitable.

The following H3 sections explain the definition of each axis and the specific criteria for making decisions.

Four Evaluation Axes: Cost, Accuracy, Update Frequency, and Security

When comparing fine-tuning and RAG, asking "which one is smarter" misses the point. The right question is: "Which is more rational given our own constraints?" To organize the decision, evaluation across the following four axes is recommended.

① Cost

Fine-tuning requires GPU resources for initial training, and retraining costs are incurred every time the model is updated
RAG primarily involves the cost of building and operating a vector database, with no need to retrain the model itself
Using PEFT methods such as LoRA or QLoRA for small-scale adaptation tends to significantly reduce training costs

② Accuracy

Fine-tuning "bakes" domain-specific writing styles, vocabulary, and reasoning patterns directly into the model, resulting in high output consistency
While RAG can reference the latest documents, the risk of hallucination remains if retrieval accuracy is low
For tasks that require cited sources, RAG's grounding capability works effectively to ensure accuracy

③ Update Frequency

For documents that change weekly or monthly—such as internal regulations or product manuals—RAG is overwhelmingly easier to manage
Fine-tuning tends to be unsuitable for use cases that require up-to-date information, as retraining cycles are long

④ Security

When sending confidential data to a cloud API is undesirable, a RAG configuration combining a local LLM with an on-premises vector database is effective
Confining a fine-tuned model to an offline environment is also an option, but it increases the operational burden of model updates

These four axes are not independent. For example, the higher the update frequency, the more fine-tuning costs balloon; and the stricter the security requirements, the more limited the options become. The next section maps these axes onto use case categories.

Use Case Classification: Knowledge Injection vs. Style Adaptation

Before choosing an LLM customization approach, it is important to first classify "what problem you are trying to solve" by the nature of the use case. Broadly speaking, use cases can be organized into two types: knowledge injection and style adaptation.

Knowledge injection refers to cases where you want the model to handle information it does not inherently possess.

Knowledge that must be brought in from outside, such as internal regulations, product specifications, or information on legal amendments
The latest information that emerged after the model's training data cutoff
Data specific to a particular company or industry (e.g., proprietary product code systems, internal glossaries)

RAG tends to be well-suited for this type. Documents are stored in a vector database and dynamically retrieved and referenced in response to queries, so additions and updates to information are reflected immediately. The approach of "baking in" knowledge through fine-tuning tends to incur high update costs, as retraining is required every time information becomes outdated.

Style adaptation refers to cases where you want to align the model's output format, writing style, or response patterns to a specific standard.

Generating output in a fixed format, such as medical reports or legal documents
Text generation aligned with a brand's tone of voice
Improving the naturalness of expression in a specific language (e.g., Thai, Japanese)

Fine-tuning is better suited for this type. Because behavioral patterns are learned directly into the model's weights, stable output can be obtained without needing to provide detailed instructions in the prompt every time.

In practice, these two categories often overlap. Requirements such as "accurately using internal terminology while responding in a specific format" call for a combined approach. The next section compares three options from the perspectives of cost, accuracy, and update frequency.

Comparison Table: Fine-Tuning vs. RAG vs. Combined Approach

Based on the evaluation axes defined in the previous section, this section provides a cross-cutting comparison of three options: fine-tuning, RAG, and a combined approach.

There are three main points of comparison.

Cost: The overall burden of training, inference, and operational expenses
Accuracy and hallucination: Tendencies in answer quality and the risk of misinformation
Ease of data updates: Operational costs for keeping information current

The details of each axis are explored in the H3 sections that follow. First, get a grasp of the overall picture, then map it against your own use case.

Cost Comparison: Training, Inference, and Operational Costs

Fine-tuning and RAG have fundamentally different cost structures. Breaking them down into three phases makes the decision easier.

Training costs (initial investment)

Fine-tuning: GPU time is the primary cost. Full fine-tuning tends to be expensive, but using PEFT methods such as LoRA or QLoRA reduces the number of trainable parameters, and cases of significantly lower costs have been reported
RAG: No training of the model itself is required. However, costs are incurred for generating document embeddings and for the initial construction of the vector database

Inference costs (runtime costs)

Fine-tuned models: Since there is no need to pack large volumes of documents into the context window, token consumption per request tends to be lower
RAG: Issuing a search query and inserting retrieved chunks into the context adds overhead, making it easy for the token count to increase per inference. This requires particular attention when referencing multiple documents

Operational costs (ongoing costs)

Fine-tuning: Retraining is required whenever knowledge becomes outdated. When handling frequently updated data, retraining costs can accumulate
RAG: The main ongoing costs are vector database updates and storage fees. Since knowledge can be updated simply by replacing documents, operational costs are relatively predictable

Summary of general cost tendencies

Phase	Fine-Tuning	RAG
Training costs	High (reducible with PEFT)	Low–Medium
Inference costs	Low	Medium–High
Operational costs	High with each update	Tends to be stable

RAG is advantageous when you want to minimize upfront investment and get started quickly. On the other hand, in production environments with very high inference volumes, the inference cost advantage of fine-tuning becomes significant. Note that the prices above reflect general tendencies at the time of writing; it is recommended to check the official pages of each cloud provider for the latest pricing.

Accuracy and Hallucination Rate Trend Comparison

Accuracy and hallucination rate are critical evaluation axes that directly inform the choice of approach. Fine-tuning and RAG each tend to produce errors through different mechanisms.

Accuracy Characteristics of Fine-Tuning

When training data is high-quality and sufficiently large, accuracy in adapting to output formats and specialized terminology for specific tasks tends to improve
Conversely, hallucinations are more likely to occur for recent information or unknown topics not covered in the training data, where the model may generate incorrect answers with high confidence
There is also a risk that biases and errors present in the training data become directly embedded in the model

Accuracy Characteristics of RAG

Because responses are generated based on retrieved documents as evidence, the source of information tends to be more transparent
However, when retrieval accuracy is low (i.e., when low-relevance chunks are retrieved), "grounding failures" are more likely to occur, resulting in responses based on incorrect context
Cases have been reported where combining BM25 with vector databases in a hybrid search approach can improve retrieval accuracy

Summary of Hallucination Rate Tendencies

Perspective	Fine-Tuning	RAG
Accuracy within trained scope	Tends to be high	Depends on retrieval quality
Handling of recent information	Weak (requires retraining)	Strong
Cause of hallucination	Knowledge embedding errors	Retrieval errors / context misalignment

Neither approach can reduce hallucinations to zero. What matters is understanding the root causes of errors and addressing them through a combination of guardrails and Human-in-the-Loop (HITL) review.

Comparison of Data Update Ease and Immediacy

Ease of data updates is one of the evaluation axes where fine-tuning and RAG differ most significantly.

Fine-Tuning: High Update Costs and Limited Immediacy

Fine-tuning requires retraining every time new knowledge needs to be incorporated. The key challenges are as follows:

Retraining requires additional GPU resources and time
Lead time from data preparation and validation to deployment is lengthy
The higher the update frequency, the more operational costs tend to accumulate

For example, attempting to manage weekly-revised internal policies or product price lists through fine-tuning would require running the training pipeline with every update. This operational burden tends to become a practical barrier to maintaining information freshness.

RAG: Immediate Reflection Through Document Replacement Alone

Because RAG generates responses by retrieving documents stored in a vector database, updating information is completed simply by rebuilding the index.

New documents can be added or overwritten to immediately reflect the latest information
No retraining of the model itself is required
Lead time from update to reflected response can be significantly reduced

For requirements such as revisions to internal manuals or responses to regulatory changes—where the need is to "update today and use it tomorrow"—RAG is the appropriate choice.

Considerations When Combining Both Approaches

An architecture that uses fine-tuning to solidify output style and understanding of industry-specific terminology, while supplementing frequently changing information with RAG, tends to offer a well-balanced design in terms of update costs and accuracy. The next section takes a deeper look at use cases where fine-tuning alone delivers particularly strong results.

Which Use Cases Are Best Suited for Fine-Tuning?

Fine-tuning demonstrates its true value in situations where you want to change the model's behavior itself. Unlike RAG, which injects knowledge from external sources, fine-tuning directly updates the model's weights, making it well-suited for use cases that require consistency in output style and response format. It has been particularly reported to be effective in industries with high volumes of specialized terminology and in workflows where a specific tone must be maintained. The H3 sections below explore specific use cases and implementation approaches in depth.

When You Need Consistent Industry-Specific Writing Style and Output Format

Fine-tuning is most powerful in situations where you want to fix the style or format of outputs. While RAG improves accuracy in "what to answer," its ability to standardize "how to answer" is limited.

The following are cases where the advantages of fine-tuning have been reported:

Medical/Pharmaceutical: Outputs requiring adherence to specific structures and terminology conventions, such as medical record summaries and clinical trial reports
Legal: Contract reviews that demand a fixed format of "risk item → basis clause → proposed response"
Financial: Avoidance of definitive expressions in investment reports and automatic inclusion of disclaimer language
Manufacturing: Strict adherence to the three-part structure of "symptom, cause, and countermeasure" in incident reports

These requirements tend to be unstable when addressed through system prompts alone. The longer the prompt, the more context window it consumes, which also increases inference costs.

By embedding "industry writing rules" into the model's weights through fine-tuning, consistent formatting tends to be maintained even with shorter prompts. Reduced variability in outputs also tends to stabilize downstream quality checks and integration with RPA.

However, there are caveats to be aware of:

If training data quality is low, there is a risk that incorrect stylistic patterns become fixed
Retraining costs are incurred every time output format specifications change
Knowledge currency cannot be guaranteed to the same degree as with RAG

For workflows where "consistency of format" is the top priority, it is rational to place fine-tuning at the top of the list of options.

Using PEFT and LoRA to Adapt the Model While Keeping Costs Down

Full fine-tuning updates all model parameters, making GPU costs and training time significant barriers. This is where PEFT (Parameter-Efficient Fine-Tuning) and its representative method, LoRA (Low-Rank Adaptation), come into focus.

How LoRA Works and Its Advantages

LoRA is a technique that freezes the original model parameters and learns only the differences by adding low-rank matrices. Because the update target is limited to roughly 1–5% of the total parameters, the following benefits emerge:

GPU memory required for training can be significantly reduced
Training time is shortened, making it easier to keep cloud costs down
Multiple LoRA adapters can be swapped in and out while retaining the original base model

Further Efficiency Gains with QLoRA

QLoRA is a technique that combines LoRA with quantization, enabling model loading and training at 4-bit precision. Cases have been reported where models with tens of billions of parameters can be adapted using a single consumer-grade GPU, making it a viable option for on-premises environments and local LLM deployments as well.

Practical Considerations

Data volume guidelines: Effects tend to emerge from several hundred to several thousand high-quality training samples
Rank (r) configuration: A smaller r reduces resource requirements but lowers expressiveness, so adjustment based on task complexity is necessary
Overfitting risk: When data volume is small, evaluation on a validation set must not be skipped

PEFT and LoRA serve as a practical entry point for organizations where full fine-tuning is cost-prohibitive, enabling style adaptation and the internalization of specialized terminology. A prudent approach is to first experiment at a proof-of-concept scale, confirm the balance between accuracy and cost, and then consider production deployment.

Which Use Cases Are Best Suited for RAG?

RAG demonstrates its true value in situations where "information freshness" and "transparency of evidence" are required. While fine-tuning changes the behavior of the model itself, RAG references external documents in real time, making it well-suited for operations where data changes frequently or where the source of answers must be explicitly stated. The following two use cases serve as the primary basis for organizing the criteria for choosing RAG.

Leveraging Frequently Updated Documents Such as Internal Policies and Product Manuals

RAG is particularly effective when dealing with frequently updated documents such as internal policies or product manuals. Fine-tuning incurs GPU costs and time with every retraining cycle, whereas RAG can reflect the latest information instantly simply by replacing the index in the vector database.

Key reasons RAG is a good fit

Easily accommodates documents whose content changes on a monthly or weekly basis, such as policy revisions, price changes, and product specification updates
Referenced chunks can be explicitly presented at the time of answer generation, making it easy for users to verify which document and page a given answer is based on
The base Foundation Model can be reused as-is, so no additional training costs arise even when documents from multiple departments are added

Practical usage patterns

In manufacturing environments, for example, product manual versions tend to be updated frequently. By simply splitting a new version of a PDF into chunks and re-registering them in the vector database, an AI chatbot can be kept ready to guide users through the latest procedures. In HR department policy management as well, adding revised employment regulation content to the index allows employee inquiries to be handled with up-to-date information immediately.

Caveats

That said, retrieval accuracy is influenced by chunk size and the quality of the embedding model. When document structure is complex, combining hybrid search (BM25 + Dense Model) has been reported to improve accuracy in some cases. In the legal and compliance use cases covered in the next section, this ability to explicitly cite sources plays an even more critical role.

Tasks Requiring Cited Sources and References: Legal, Medical, and Compliance

In the legal, medical, and compliance domains, transparency of evidence—demonstrating why a given answer is correct—is indispensable. RAG has a structural advantage in meeting this requirement.

Why RAG excels at citation and source management

The original source documents referenced during answer generation can be presented directly to the user
It is easy to specify which article of which regulation an answer is based on, making it useful for audit responses as well
When documents are updated, simply replacing the vector database is sufficient for answers to follow suit

Taking the legal department as an example, in contract review and internal regulation inquiries, it becomes difficult to adopt AI in practice if the basis for AI-generated text cannot be verified on the spot. With RAG, retrieved chunks can be attached directly as citations, which tends to significantly reduce the effort required for staff to go back and check the original text.

In the medical field, referencing clinical guidelines and package inserts makes it possible to provide evidence-backed information while suppressing the risk of hallucination. However, direct application to clinical decision-making requires separate specialized consideration, and it is recommended that AI be designed to serve solely as a supplementary aid for information retrieval.

For compliance use cases, periodically updating the index with regulatory documents such as the EU AI Act and PDPA has been reported to reduce the cost of responding to legislative changes in some cases.

Division of roles with fine-tuning

Fine-tuning is effective for standardizing writing style and output format, but it is structurally ill-suited to answering the question "what is the basis for this answer?" For tasks that require transparency in citations and sourcing, designing around RAG as the core approach is the appropriate choice.

How to Design a Combined Approach

Fine-tuning and RAG are not an either/or choice—they can be designed to complement each other's weaknesses when combined. An architecture in which fine-tuning is used to instill specialized writing styles and reasoning patterns in the model, while RAG dynamically supplies up-to-date information, is a strong option for achieving both accuracy and freshness. The H3 sections below explain specific architectural design approaches and how to combine them with Agentic RAG.

Architecture for Layering RAG on Top of a Fine-Tuned Model

Architectures that combine a fine-tuned model with RAG have attracted attention for their ability to mutually compensate for each approach's weaknesses. The core concept is a division of responsibilities: "use fine-tuning to solidify model behavior, and rely on RAG to ensure knowledge freshness."

Basic architectural structure

Fine-tuning layer: Trains the model on industry-specific tone, output format, and handling of specialized terminology
RAG layer: Dynamically retrieves the latest internal policies and product information from a vector database and injects it into the context window
System prompt layer: Serves as the bridge between the two, containing instructions on how to use the retrieved results

In this structure, fine-tuning handles how to answer, while RAG handles what to answer, resulting in a clear separation of responsibilities.

Design considerations

When layering RAG on top of a fine-tuned model, cases can arise where retrieved results conflict with the model's trained knowledge. In such situations, explicitly stating in the system prompt that "retrieved results take priority" tends to reduce the risk of hallucination.

Chunk size design is also important. When short chunks are passed to a model that has been fine-tuned for long-form output, context can be severed, and degraded accuracy has been reported in some cases. It is recommended to adjust chunk size to match the model's output style.

At the PoC stage, it is cost-effective to first verify accuracy with a base model + RAG, and only then add fine-tuning via PEFT or LoRA if output quality remains insufficient.

Combining with Dynamic Retrieval in Agentic RAG

Agentic RAG is an architecture in which an AI agent autonomously controls the retrieval step of RAG. Whereas traditional static RAG followed a fixed flow of "single retrieval → answer generation," Agentic RAG has the agent dynamically repeat multiple rounds of retrieval, reasoning, and re-retrieval.

Combining a fine-tuned model with Agentic RAG creates the following division of responsibilities:

Fine-tuned model: Handles industry-specific writing style, output format, and technical terminology
Agent layer: Handles query decomposition, determining the order in which retrieval tools are called, and evaluating results
Vector database: Handles storage of up-to-date documents and similarity search

For example, in legal review workflows, it is possible to build a flow in which queries are decomposed clause by clause from a contract, an internal policy database and a case law database are searched sequentially, and the fine-tuned model then generates a response in the legal department's standard format.

The main benefits of this design are as follows:

Supports multi-step reasoning, which tends to improve answer accuracy for complex questions
Automatically triggers re-retrieval when search results are insufficient, making it easier to suppress hallucinations
When documents are updated, only the vector database needs to be updated—no redesign of the agent layer is required

On the other hand, it is worth noting that the cost of designing and testing agent orchestration increases. At the PoC stage, it is practical to start with static RAG and migrate to Agentic RAG only when the need to handle more complex queries arises.

Decision Flowchart for Choosing the Best Option for Your Organization

Building on the comparisons made so far, this section organizes the decision criteria for quickly narrowing down the approach that best fits your organization's situation.

The main source of confusion in making a selection is that three variables—budget, data volume, and update frequency—are all intertwined at the same time. The following H3 sections walk through a three-step flow for checking each of these in order, along with additional considerations for multilingual environments.

A Three-Step Decision Framework Based on Budget, Data Volume, and Update Frequency

When you are unsure which approach to choose, working through the following three steps can help clarify your thinking.

Step 1: Assess your budget

Estimate initial investment and ongoing operational costs separately, including GPU cloud costs and API usage fees.

Fine-tuning incurs a certain amount of GPU cost during training, but inference costs are comparable to those of a standard LLM
For RAG, the main running costs are vector database maintenance and retrieval API fees
If budget is limited, consider using PEFT methods such as LoRA or QLoRA to reduce training costs

Step 2: Assess the volume of available data

Evaluate both the "quantity" and "quality" of the data you have on hand.

If you can secure several hundred to several thousand or more high-quality supervised training examples, fine-tuning tends to be more effective
If your primary assets are existing documents that are difficult to structure, RAG can often be deployed more quickly
When data volume is still low, a rational sequence is to first run a PoC with RAG, confirm its effectiveness, and then consider fine-tuning

Step 3: Assess update frequency

Freshness requirements for information directly influence the choice of approach.

Workflows where documents are updated weekly or monthly—such as internal policies or product specifications—are well suited to RAG, since re-indexing alone is sufficient to keep up with changes
Conversely, output style and industry-specific expression patterns change infrequently, so once they are established through fine-tuning, it is easier to maintain consistent quality
When update frequency is "high" and "citation of sources is required," a combined approach becomes a practical option

If the decision remains difficult after going through all three steps, also refer to the additional considerations for multilingual environments covered in the next section.

Additional Considerations for Multilingual Environments Including Thai and Japanese

In environments that handle both Thai and Japanese simultaneously, it is necessary to consider language-specific technical challenges in addition to the straightforward fine-tuning vs. RAG choice.

Tokenizer issues

Many BPE tokenizers are designed with English as the baseline, and Thai and Japanese tend to consume several times more tokens per character than English. Because this directly affects cost estimates, it is important to measure the actual token counts for each language in advance.

Considerations for fine-tuning

If training data does not include a balanced number of samples for each language, the quality of one language tends to degrade significantly
Because Thai has no spaces between words, setting chunk boundaries is difficult, and RAG chunk size design may require dedicated logic
Japanese has significant variation in honorific levels and writing style; if style unification is a goal, fine-tuning tends to be effective

Considerations for RAG design

The multilingual quality of embedding models varies considerably from model to model. To ensure semantic search accuracy in Thai, it is advisable to select a model with strong multilingual NLP support and verify accuracy through empirical testing
When adopting hybrid search (BM25 + vector search), always confirm that the morphological analyzer used by BM25 supports Thai and Japanese

Practical decision criteria

When high-quality processing of both Thai and Japanese is required, starting with a base model that has strong multilingual capabilities and building RAG on top of it tends to be the more practical choice, as it avoids the cost of balancing language quality during fine-tuning.

Frequently Asked Questions

When considering the adoption of fine-tuning and RAG, the same questions tend to come up repeatedly from practitioners. This section addresses the two most frequently asked topics: "how much data is needed" and "whether a combination with a local LLM is feasible." Read through this section as a final check in your selection process, comparing the points raised against your own organization's situation.

How Much Data Is Required for Fine-Tuning?

Many people give up on fine-tuning by assuming they don't have enough data, but in reality, the required amount varies significantly depending on the method and model scale.

In the case of full fine-tuning

The general benchmark is typically thousands to tens of thousands of high-quality training samples
The less data available, the higher the risk of overfitting, which tends to compromise generalizability
Larger models require more data, so the tradeoff with GPU costs must be carefully considered

When using PEFT / LoRA

There are reported cases where even a few hundred to a few thousand samples have yielded meaningful results
Because LoRA updates only a portion of the model's weights, it tends to suppress overfitting even with limited data
Using QLoRA further reduces memory consumption, making it easier to experiment on a local GPU

It is also important not to overlook the fact that data quality takes priority over data quantity. It is not uncommon for 500 carefully labeled samples to outperform 10,000 noisy ones.

On the other hand, when data is still under 100 samples, it is practical to prioritize exploring RAG first. Since knowledge can be referenced immediately simply by storing documents in a vector database, value can be delivered early while keeping data collection costs low.

A phased approach—starting with a PoC using a small dataset and transitioning to fine-tuning only if accuracy fails to meet requirements—is the most effective way to minimize risk.

Can Fine-Tuning and RAG Be Combined with a Local LLM?

To put it plainly, combining fine-tuning and RAG is entirely achievable even with a local LLM. Because it does not rely on cloud APIs, it is a particularly compelling option for organizations that do not want to send confidential data to external services.

Key configuration examples for local deployment

Base model: Open-weight models such as Llama or Mistral, fine-tuned with QLoRA
Vector database: Chroma or Weaviate deployed on-premises
Embedding model: Locally running models such as BGE or E5 for document vectorization
Inference server: Ollama or vLLM to expose an API endpoint

This configuration enables a fully self-contained RAG pipeline within an internal network.

Key considerations

GPU memory constraints are significant, and models around 7B parameters are often the practical choice
If the language support of the fine-tuned model and the embedding model are not aligned, search accuracy for languages such as Japanese or Thai tends to degrade
Quantization can reduce model size, but the tradeoff with accuracy must be validated

The cost reality

While there are no API fees as with cloud services, fixed costs for GPU procurement, power, and maintenance do apply. For high-frequency workloads, long-term cost advantages are more likely to materialize; conversely, for low-frequency use cases, cloud-based solutions have been reported to be more cost-effective.

The technical barrier for running a combined configuration on a local LLM is relatively high, but it is a strong option in environments with strict data sovereignty or security requirements. It is recommended to first validate at PoC scale and confirm the balance between operational costs and accuracy.

Conclusion: Choosing the Optimal Method Based on Your Goals and Resources

Fine-tuning and RAG are not an either/or proposition where one is superior to the other—they are complementary approaches to be used based on "what problem you are trying to solve." Let us revisit the comparisons made throughout this article and clarify the key decision criteria.

Cases where fine-tuning is appropriate:

You want to align output style and format with industry standards
You are working with a closed knowledge domain that requires no external retrieval at inference time
You have an environment where training costs can be reduced through PEFT or LoRA

Cases where RAG is appropriate:

Documents such as internal policies or product manuals are updated frequently
In fields such as legal or medical, answers must cite their sources and provide clear grounds
You want to minimize upfront investment and start from the PoC stage

Cases where combining both is effective:

Both domain-specific expressiveness and real-time information retrieval are required
Business workflows that require dynamic, multi-step reasoning with Agentic RAG

The starting point for decision-making is two axes: frequency of data updates and budget scale. If updates occur monthly or more frequently, RAG becomes the practical choice; if consistency in style and output format is the top priority, fine-tuning tends to have the edge.

Neither approach can reduce the risk of hallucination to zero. Designing guardrails and HITL (Human-in-the-Loop) mechanisms in parallel is what ultimately determines the quality of production deployments. The most realistic path to building results while managing risk is to first validate hypotheses through a small-scale PoC, then make scaling decisions based on empirical measurements.

Author & Supervisor

Yusuke Ishihara

Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).