Choosing Between Fine-Tuning and RAG: A Practical Guide Comparing Cost, Accuracy, and Use Cases

Choosing Between Fine-Tuning and RAG to Make AI Smarter with Internal Data: The Decision Significantly Impacts Both Cost and Accuracy
When leveraging Generative AI in business operations, fine-tuning and RAG (Retrieval-Augmented Generation) represent two leading approaches to the question of "how to incorporate internal data." The former directly updates the weights of an LLM (Large Language Model) to embed knowledge, while the latter retrieves external documents in real time to generate responses.
This article is intended for engineers, product managers, and IT system administrators considering AI adoption. It organizes both methods along four axes—cost, accuracy, update frequency, and security—to help you make the right choice for your organization's use case.
By the end, you will be able to make concrete judgments about "why combining the two approaches is effective" and "under which conditions each should be prioritized."
When applying LLMs in a business context, the question of "how to incorporate internal knowledge" is unavoidable. The two leading methods that answer this question are fine-tuning and RAG (Retrieval-Augmented Generation).
The two approaches differ fundamentally. Fine-tuning updates the model's weights directly, while RAG retrieves external documents and incorporates them into the response. Each has its own strengths and limitations, and there are reported cases where choosing the wrong method for a given use case has led to unexpected costs or degraded accuracy.
Let's start by reviewing the basic mechanisms of each method to build a foundation for making that choice.
How Fine-Tuning Works and Its Impact on the Model
Fine-tuning is a technique that adapts a pre-trained Foundation Model to a specific task by re-training its weight parameters on additional data. Because the model itself "internalizes" the knowledge, there is no need to reference external data at inference time.
Training workflow
- Prepare supervised training data (input–output pairs)
- Update parameters while minimizing the loss function
- Deploy the updated model for inference
The impact on the model manifests in two key ways. The first is output style fixation. The model becomes capable of consistently reproducing specific writing styles and formats, such as the standard phrasing of legal documents or the notation conventions of medical records. The second is domain vocabulary reinforcement. Through training, the model tends to handle industry-specific terminology and expressions not present in the BPE tokenizer (Byte-Pair Encoding Tokenizer) vocabulary more effectively.
However, there are also constraints to be aware of.
- High training cost: Full fine-tuning consumes large amounts of GPU (Graphics Processing Unit) resources
- Knowledge freshness degrades easily: Information that arises after training is not reflected in the model
- Hallucination risk: The model may attempt to fill in facts not present in the training data
PEFT (Parameter-Efficient Fine-Tuning) and LoRA have become widely adopted as cost-reduction techniques, updating only a subset of low-rank matrices rather than all parameters, which significantly reduces training costs. Nevertheless, since the model's weights are being modified, re-training and re-deployment remain necessary with each update. The key distinction from RAG, covered in the next section, lies precisely in this characteristic: knowledge becomes fixed inside the model.
How RAG Works and the Retrieval-Augmented Generation Flow
RAG (Retrieval-Augmented Generation) is a technique in which, before an LLM generates a response, relevant information is retrieved from an external knowledge source and incorporated into the prompt as context. Because the model's parameters themselves are not modified, knowledge updates are completed entirely through operations on the database side.
Basic processing flow
- Convert the user's query into an embedding
- Perform a similarity search against a vector database to retrieve relevant chunks
- Insert the retrieved documents into the system prompt or context window
- The LLM generates a response based on the augmented context
This mechanism allows information the model did not know at training time to be reflected in its responses, provided the relevant documents have been registered. For example, when targeting internal regulations or product manuals that are revised monthly, simply updating the documents allows the latest information to be reflected without retraining the model.
Key techniques for improving retrieval accuracy
- Hybrid search: Combining vector search with BM25 (keyword search) via RRF to achieve both recall and precision
- Chunk size tuning: Granularity that is too coarse introduces irrelevant information, while granularity that is too fine causes loss of context
- Agentic RAG: Handling complex queries through multi-step reasoning with repeated retrieval passes
On the other hand, RAG is constrained by the amount of information that fits within the context window, and knowledge that is not surfaced by the search will not be used in the response. Furthermore, since grounding quality depends heavily on the design of the retrieval engine, the quality of index design and preprocessing directly determines overall output quality.
A Common Misconception: The Assumption That You Must Choose One or the Other
In discussions of fine-tuning versus RAG, the misconception that one must choose one or the other tends to spread easily. In practice, however, architectures that combine both methods are often effective.
This misconception arises from several factors:
- The explanation that "fine-tuning = baking knowledge into the model" takes on a life of its own, making RAG appear unnecessary
- The expectation that "RAG can answer anything by searching" leads to undervaluing customization of the model itself
- Decisions are made based solely on a comparison of implementation costs, overlooking differences in accuracy and operational overhead
What is the reality? Fine-tuning excels at improving the consistency of output style and response format, but struggles to keep pace with the latest documents. RAG is strong at real-time knowledge retrieval, but if the model's own vocabulary and writing style are not aligned with the business domain, there are cases where it cannot effectively leverage the retrieved results.
In other words, the two are not competing approaches but rather complementary ones.
As a concrete example, consider a chatbot for the legal domain. A configuration in which fine-tuning is used to acquire the writing style and output format specific to legal documents, while the latest case law and regulatory amendments are referenced via RAG, makes it easier to achieve both accuracy and up-to-dateness simultaneously.
The next section organizes the evaluation criteria for deciding which approach to prioritize. Clarifying "what matters most" is the starting point for selecting the optimal method.
How to Define the Comparison Criteria
To fairly compare fine-tuning and RAG, it is important to define your evaluation criteria first, rather than asking "which one is better?"
The four main axes of comparison are cost, accuracy, update frequency, and security. Furthermore, by classifying use cases into "knowledge injection" and "style adaptation" types, it becomes clearer which approach is fundamentally more suitable.
The following H3 sections explain the definition of each axis and the specific criteria for making decisions.
Four Evaluation Axes: Cost, Accuracy, Update Frequency, and Security
When comparing fine-tuning and RAG, asking "which one is smarter" misses the point. The right question is: "Which is more rational given our own constraints?" To organize the decision, evaluation across the following four axes is recommended.
① Cost
- Fine-tuning requires GPU resources for initial training, and retraining costs are incurred every time the model is updated
- RAG primarily involves the cost of building and operating a vector database, with no need to retrain the model itself
- Using PEFT methods such as LoRA or QLoRA for small-scale adaptation tends to significantly reduce training costs
② Accuracy
- Fine-tuning "bakes" domain-specific writing styles, vocabulary, and reasoning patterns directly into the model, resulting in high output consistency
- While RAG can reference the latest documents, the risk of hallucination remains if retrieval accuracy is low
- For tasks that require cited sources, RAG's grounding capability works effectively to ensure accuracy
③ Update Frequency
- For documents that change weekly or monthly—such as internal regulations or product manuals—RAG is overwhelmingly easier to manage
- Fine-tuning tends to be unsuitable for use cases that require up-to-date information, as retraining cycles are long
④ Security
- When sending confidential data to a cloud API is undesirable, a RAG configuration combining a local LLM with an on-premises vector database is effective
- Confining a fine-tuned model to an offline environment is also an option, but it increases the operational burden of model updates
These four axes are not independent. For example, the higher the update frequency, the more fine-tuning costs balloon; and the stricter the security requirements, the more limited the options become. The next section maps these axes onto use case categories.
Use Case Classification: Knowledge Injection vs. Style Adaptation
Before choosing an LLM customization approach, it is important to first classify "what problem you are trying to solve" by the nature of the use case. Broadly speaking, use cases can be organized into two types: knowledge injection and style adaptation.
Knowledge injection refers to cases where you want the model to handle information it does not inherently possess.
- Knowledge that must be brought in from outside, such as internal regulations, product specifications, or information on legal amendments
- The latest information that emerged after the model's training data cutoff
- Data specific to a particular company or industry (e.g., proprietary product code systems, internal glossaries)
RAG tends to be well-suited for this type. Documents are stored in a vector database and dynamically retrieved and referenced in response to queries, so additions and updates to information are reflected immediately. The approach of "baking in" knowledge through fine-tuning tends to incur high update costs, as retraining is required every time information becomes outdated.
Style adaptation refers to cases where you want to align the model's output format, writing style, or response patterns to a specific standard.
- Generating output in a fixed format, such as medical reports or legal documents
- Text generation aligned with a brand's tone of voice
- Improving the naturalness of expression in a specific language (e.g., Thai, Japanese)
Fine-tuning is better suited for this type. Because behavioral patterns are learned directly into the model's weights, stable output can be obtained without needing to provide detailed instructions in the prompt every time.
In practice, these two categories often overlap. Requirements such as "accurately using internal terminology while responding in a specific format" call for a combined approach. The next section compares three options from the perspectives of cost, accuracy, and update frequency.
Comparison Table: Fine-Tuning vs. RAG vs. Combined Approach
Based on the evaluation axes defined in the previous section, this section provides a cross-cutting comparison of three options: fine-tuning, RAG, and a combined approach.
There are three main points of comparison.
- Cost: The overall burden of training, inference, and operational expenses
- Accuracy and hallucination: Tendencies in answer quality and the risk of misinformation
- Ease of data updates: Operational costs for keeping information current
The details of each axis are explored in the H3 sections that follow. First, get a grasp of the overall picture, then map it against your own use case.
Cost Comparison: Training, Inference, and Operational Costs
Fine-tuning and RAG have fundamentally different cost structures. Breaking them down into three phases makes the decision easier.
Training costs (initial investment)
- Fine-tuning: GPU time is the primary cost. Full fine-tuning tends to be expensive, but using PEFT methods such as LoRA or QLoRA reduces the number of trainable parameters, and cases of significantly lower costs have been reported
- RAG: No training of the model itself is required. However, costs are incurred for generating document embeddings and for the initial construction of the vector database
Inference costs (runtime costs)
- Fine-tuned models: Since there is no need to pack large volumes of documents into the context window, token consumption per request tends to be lower
- RAG: Issuing a search query and inserting retrieved chunks into the context adds overhead, making it easy for the token count to increase per inference. This requires particular attention when referencing multiple documents
Operational costs (ongoing costs)
- Fine-tuning: Retraining is required whenever knowledge becomes outdated. When handling frequently updated data, retraining costs can accumulate
- RAG: The main ongoing costs are vector database updates and storage fees. Since knowledge can be updated simply by replacing documents, operational costs are relatively predictable
Summary of general cost tendencies
| Phase | Fine-Tuning | RAG |
|---|---|---|
| Training costs | High (reducible with PEFT) | Low–Medium |
| Inference costs | Low | Medium–High |
| Operational costs | High with each update | Tends to be stable |
RAG is advantageous when you want to minimize upfront investment and get started quickly. On the other hand, in production environments with very high inference volumes, the inference cost advantage of fine-tuning becomes significant. Note that the prices above reflect general tendencies at the time of writing; it is recommended to check the official pages of each cloud provider for the latest pricing.
Accuracy and Hallucination Rate Trend Comparison
Accuracy and hallucination rate are critical evaluation axes that directly inform the choice of approach. Fine-tuning and RAG each tend to produce errors through different mechanisms.
Accuracy Characteristics of Fine-Tuning
- When training data is high-quality and sufficiently large, accuracy in adapting to output formats and specialized terminology for specific tasks tends to improve
- Conversely, hallucinations are more likely to occur for recent information or unknown topics not covered in the training data, where the model may generate incorrect answers with high confidence
- There is also a risk that biases and errors present in the training data become directly embedded in the model
Accuracy Characteristics of RAG
- Because responses are generated based on retrieved documents as evidence, the source of information tends to be more transparent
- However, when retrieval accuracy is low (i.e., when low-relevance chunks are retrieved), "grounding failures" are more likely to occur, resulting in responses based on incorrect context
- Cases have been reported where combining BM25 with vector databases in a hybrid search approach can improve retrieval accuracy
Summary of Hallucination Rate Tendencies
| Perspective | Fine-Tuning | RAG |
|---|---|---|
| Accuracy within trained scope | Tends to be high | Depends on retrieval quality |
| Handling of recent information | Weak (requires retraining) | Strong |
| Cause of hallucination | Knowledge embedding errors | Retrieval errors / context misalignment |
Neither approach can reduce hallucinations to zero. What matters is understanding the root causes of errors and addressing them through a combination of guardrails and Human-in-the-Loop (HITL) review.
Comparison of Data Update Ease and Immediacy
Ease of data updates is one of the evaluation axes where fine-tuning and RAG differ most significantly.
Fine-Tuning: High Update Costs and Limited Immediacy
Fine-tuning requires retraining every time new knowledge needs to be incorporated. The key challenges are as follows:
- Retraining requires additional GPU resources and time
- Lead time from data preparation and validation to deployment is lengthy
- The higher the update frequency, the more operational costs tend to accumulate
For example, attempting to manage weekly-revised internal policies or product price lists through fine-tuning would require running the training pipeline with every update. This operational burden tends to become a practical barrier to maintaining information freshness.
RAG: Immediate Reflection Through Document Replacement Alone
Because RAG generates responses by retrieving documents stored in a vector database, updating information is completed simply by rebuilding the index.
- New documents can be added or overwritten to immediately reflect the latest information
- No retraining of the model itself is required
- Lead time from update to reflected response can be significantly reduced
For requirements such as revisions to internal manuals or responses to regulatory changes—where the need is to "update today and use it tomorrow"—RAG is the appropriate choice.
Considerations When Combining Both Approaches
An architecture that uses fine-tuning to solidify output style and understanding of industry-specific terminology, while supplementing frequently changing information with RAG, tends to offer a well-balanced design in terms of update costs and accuracy. The next section takes a deeper look at use cases where fine-tuning alone delivers particularly strong results.
Which Use Cases Are Best Suited for Fine-Tuning?
Fine-tuning demonstrates its true value in situations where you want to change the model's behavior itself. Unlike RAG, which injects knowledge from external sources, fine-tuning directly updates the model's weights, making it well-suited for use cases that require consistency in output style and response format. It has been particularly reported to be effective in industries with high volumes of specialized terminology and in workflows where a specific tone must be maintained. The H3 sections below explore specific use cases and implementation approaches in depth.
When You Need Consistent Industry-Specific Writing Style and Output Format
Fine-tuning is most powerful in situations where you want to fix the style or format of outputs. While RAG improves accuracy in "what to answer," its ability to standardize "how to answer" is limited.
The following are cases where the advantages of fine-tuning have been reported:
- Medical/Pharmaceutical: Outputs requiring adherence to specific structures and terminology conventions, such as medical record summaries and clinical trial reports
- Legal: Contract reviews that demand a fixed format of "risk item → basis clause → proposed response"
- Financial: Avoidance of definitive expressions in investment reports and automatic inclusion of disclaimer language
- Manufacturing: Strict adherence to the three-part structure of "symptom, cause, and countermeasure" in incident reports
These requirements tend to be unstable when addressed through system prompts alone. The longer the prompt, the more context window it consumes, which also increases inference costs.
By embedding "industry writing rules" into the model's weights through fine-tuning, consistent formatting tends to be maintained even with shorter prompts. Reduced variability in outputs also tends to stabilize downstream quality checks and integration with RPA.
However, there are caveats to be aware of:
- If training data quality is low, there is a risk that incorrect stylistic patterns become fixed
- Retraining costs are incurred every time output format specifications change
- Knowledge currency cannot be guaranteed to the same degree as with RAG
For workflows where "consistency of format" is the top priority, it is rational to place fine-tuning at the top of the list of options.
Using PEFT and LoRA to Adapt the Model While Keeping Costs Down
Full fine-tuning updates all model parameters, making GPU costs and training time significant barriers. This is where PEFT (Parameter-Efficient Fine-Tuning) and its representative method, LoRA (Low-Rank Adaptation), come into focus.
How LoRA Works and Its Advantages
LoRA is a technique that freezes the original model parameters and learns only the differences by adding low-rank matrices. Because the update target is limited to roughly 1–5% of the total parameters, the following benefits emerge:
- GPU memory required for training can be significantly reduced
- Training time is shortened, making it easier to keep cloud costs down
- Multiple LoRA adapters can be swapped in and out while retaining the original base model
Further Efficiency Gains with QLoRA
QLoRA is a technique that combines LoRA with quantization, enabling model loading and training at 4-bit precision. Cases have been reported where models with tens of billions of parameters can be adapted using a single consumer-grade GPU, making it a viable option for on-premises environments and local LLM deployments as well.
Practical Considerations
- Data volume guidelines: Effects tend to emerge from several hundred to several thousand high-quality training samples
- Rank (r) configuration: A smaller r reduces resource requirements but lowers expressiveness, so adjustment based on task complexity is necessary
- Overfitting risk: When data volume is small, evaluation on a validation set must not be skipped
PEFT and LoRA serve as a practical entry point for organizations where full fine-tuning is cost-prohibitive, enabling style adaptation and the internalization of specialized terminology. A prudent approach is to first experiment at a proof-of-concept scale, confirm the balance between accuracy and cost, and then consider production deployment.
Which Use Cases Are Best Suited for RAG?
RAG demonstrates its true value in situations where "information freshness" and "transparency of evidence" are required. While fine-tuning changes the behavior of the model itself, RAG references external documents in real time, making it well-suited for operations where data changes frequently or where the source of answers must be explicitly stated. The following two use cases serve as the primary basis for organizing the criteria for choosing RAG.
Leveraging Frequently Updated Documents Such as Internal Policies and Product Manuals
RAG is particularly effective when dealing with frequently updated documents such as internal policies or product manuals. Fine-tuning incurs GPU costs and time with every retraining cycle, whereas RAG can reflect the latest information instantly simply by replacing the index in the vector database.
Key reasons RAG is a good fit
- Easily accommodates documents whose content changes on a monthly or weekly basis, such as policy revisions, price changes, and product specification updates
- Referenced chunks can be explicitly presented at the time of answer generation, making it easy for users to verify which document and page a given answer is based on
- The base Foundation Model can be reused as-is, so no additional training costs arise even when documents from multiple departments are added
Practical usage patterns
In manufacturing environments, for example, product manual versions tend to be updated frequently. By simply splitting a new version of a PDF into chunks and re-registering them in the vector database, an AI chatbot can be kept ready to guide users through the latest procedures. In HR department policy management as well, adding revised employment regulation content to the index allows employee inquiries to be handled with up-to-date information immediately.
Caveats
That said, retrieval accuracy is influenced by chunk size and the quality of the embedding model. When document structure is complex, combining hybrid search (BM25 + Dense Model) has been reported to improve accuracy in some cases. In the legal and compliance use cases covered in the next section, this ability to explicitly cite sources plays an even more critical role.
Tasks Requiring Cited Sources and References: Legal, Medical, and Compliance
In the legal, medical, and compliance domains, transparency of evidence—demonstrating why a given answer is correct—is indispensable. RAG has a structural advantage in meeting this requirement.
Why RAG excels at citation and source management
- The original source documents referenced during answer generation can be presented directly to the user
- It is easy to specify which article of which regulation an answer is based on, making it useful for audit responses as well
- When documents are updated, simply replacing the vector database is sufficient for answers to follow suit
Taking the legal department as an example, in contract review and internal regulation inquiries, it becomes difficult to adopt AI in practice if the basis for AI-generated text cannot be verified on the spot. With RAG, retrieved chunks can be attached directly as citations, which tends to significantly reduce the effort required for staff to go back and check the original text.
In the medical field, referencing clinical guidelines and package inserts makes it possible to provide evidence-backed information while suppressing the risk of hallucination. However, direct application to clinical decision-making requires separate specialized consideration, and it is recommended that AI be designed to serve solely as a supplementary aid for information retrieval.
For compliance use cases, periodically updating the index with regulatory documents such as the EU AI Act and PDPA has been reported to reduce the cost of responding to legislative changes in some cases.
Division of roles with fine-tuning
Fine-tuning is effective for standardizing writing style and output format, but it is structurally ill-suited to answering the question "what is the basis for this answer?" For tasks that require transparency in citations and sourcing, designing around RAG as the core approach is the appropriate choice.
How to Design a Combined Approach
Fine-tuning and RAG are not an either/or choice—they can be designed to complement each other's weaknesses when combined. An architecture in which fine-tuning is used to instill specialized writing styles and reasoning patterns in the model, while RAG dynamically supplies up-to-date information, is a strong option for achieving both accuracy and freshness. The H3 sections below explain specific architectural design approaches and how to combine them with Agentic RAG.
Architecture for Layering RAG on Top of a Fine-Tuned Model
Architectures that combine a fine-tuned model with RAG have attracted attention for their ability to mutually compensate for each approach's weaknesses. The core concept is a division of responsibilities: "use fine-tuning to solidify model behavior, and rely on RAG to ensure knowledge freshness."
Basic architectural structure
- Fine-tuning layer: Trains the model on industry-specific tone, output format, and handling of specialized terminology
- RAG layer: Dynamically retrieves the latest internal policies and product information from a vector database and injects it into the context window
- System prompt layer: Serves as the bridge between the two, containing instructions on how to use the retrieved results
In this structure, fine-tuning handles how to answer, while RAG handles what to answer, resulting in a clear separation of responsibilities.
Design considerations
When layering RAG on top of a fine-tuned model, cases can arise where retrieved results conflict with the model's trained knowledge. In such situations, explicitly stating in the system prompt that "retrieved results take priority" tends to reduce the risk of hallucination.
Chunk size design is also important. When short chunks are passed to a model that has been fine-tuned for long-form output, context can be severed, and degraded accuracy has been reported in some cases. It is recommended to adjust chunk size to match the model's output style.
At the PoC stage, it is cost-effective to first verify accuracy with a base model + RAG, and only then add fine-tuning via PEFT or LoRA if output quality remains insufficient.
Combining with Dynamic Retrieval in Agentic RAG
Agentic RAG is an architecture in which an AI agent autonomously controls the retrieval step of RAG. Whereas traditional static RAG followed a fixed flow of "single retrieval → answer generation," Agentic RAG has the agent dynamically repeat multiple rounds of retrieval, reasoning, and re-retrieval.
Combining a fine-tuned model with Agentic RAG creates the following division of responsibilities:
- Fine-tuned model: Handles industry-specific writing style, output format, and technical terminology
- Agent layer: Handles query decomposition, determining the order in which retrieval tools are called, and evaluating results
- Vector database: Handles storage of up-to-date documents and similarity search
For example, in legal review workflows, it is possible to build a flow in which queries are decomposed clause by clause from a contract, an internal policy database and a case law database are searched sequentially, and the fine-tuned model then generates a response in the legal department's standard format.
The main benefits of this design are as follows:
- Supports multi-step reasoning, which tends to improve answer accuracy for complex questions
- Automatically triggers re-retrieval when search results are insufficient, making it easier to suppress hallucinations
- When documents are updated, only the vector database needs to be updated—no redesign of the agent layer is required
On the other hand, it is worth noting that the cost of designing and testing agent orchestration increases. At the PoC stage, it is practical to start with static RAG and migrate to Agentic RAG only when the need to handle more complex queries arises.
Decision Flowchart for Choosing the Best Option for Your Organization
Building on the comparisons made so far, this section organizes the decision criteria for quickly narrowing down the approach that best fits your organization's situation.
The main source of confusion in making a selection is that three variables—budget, data volume, and update frequency—are all intertwined at the same time. The following H3 sections walk through a three-step flow for checking each of these in order, along with additional considerations for multilingual environments.
A Three-Step Decision Framework Based on Budget, Data Volume, and Update Frequency
When you are unsure which approach to choose, working through the following three steps can help clarify your thinking.
Step 1: Assess your budget
Estimate initial investment and ongoing operational costs separately, including GPU cloud costs and API usage fees.
- Fine-tuning incurs a certain amount of GPU cost during training, but inference costs are comparable to those of a standard LLM
- For RAG, the main running costs are vector database maintenance and retrieval API fees
- If budget is limited, consider using PEFT methods such as LoRA or QLoRA to reduce training costs
Step 2: Assess the volume of available data
Evaluate both the "quantity" and "quality" of the data you have on hand.
- If you can secure several hundred to several thousand or more high-quality supervised training examples, fine-tuning tends to be more effective
- If your primary assets are existing documents that are difficult to structure, RAG can often be deployed more quickly
- When data volume is still low, a rational sequence is to first run a PoC with RAG, confirm its effectiveness, and then consider fine-tuning
Step 3: Assess update frequency
Freshness requirements for information directly influence the choice of approach.
- Workflows where documents are updated weekly or monthly—such as internal policies or product specifications—are well suited to RAG, since re-indexing alone is sufficient to keep up with changes
- Conversely, output style and industry-specific expression patterns change infrequently, so once they are established through fine-tuning, it is easier to maintain consistent quality
- When update frequency is "high" and "citation of sources is required," a combined approach becomes a practical option
If the decision remains difficult after going through all three steps, also refer to the additional considerations for multilingual environments covered in the next section.
Additional Considerations for Multilingual Environments Including Thai and Japanese
In environments that handle both Thai and Japanese simultaneously, it is necessary to consider language-specific technical challenges in addition to the straightforward fine-tuning vs. RAG choice.
Tokenizer issues
Many BPE tokenizers are designed with English as the baseline, and Thai and Japanese tend to consume several times more tokens per character than English. Because this directly affects cost estimates, it is important to measure the actual token counts for each language in advance.
Considerations for fine-tuning
- If training data does not include a balanced number of samples for each language, the quality of one language tends to degrade significantly
- Because Thai has no spaces between words, setting chunk boundaries is difficult, and RAG chunk size design may require dedicated logic
- Japanese has significant variation in honorific levels and writing style; if style unification is a goal, fine-tuning tends to be effective
Considerations for RAG design
- The multilingual quality of embedding models varies considerably from model to model. To ensure semantic search accuracy in Thai, it is advisable to select a model with strong multilingual NLP support and verify accuracy through empirical testing
- When adopting hybrid search (BM25 + vector search), always confirm that the morphological analyzer used by BM25 supports Thai and Japanese
Practical decision criteria
When high-quality processing of both Thai and Japanese is required, starting with a base model that has strong multilingual capabilities and building RAG on top of it tends to be the more practical choice, as it avoids the cost of balancing language quality during fine-tuning.
Frequently Asked Questions
When considering the adoption of fine-tuning and RAG, the same questions tend to come up repeatedly from practitioners. This section addresses the two most frequently asked topics: "how much data is needed" and "whether a combination with a local LLM is feasible." Read through this section as a final check in your selection process, comparing the points raised against your own organization's situation.
How Much Data Is Required for Fine-Tuning?
Many people give up on fine-tuning by assuming they don't have enough data, but in reality, the required amount varies significantly depending on the method and model scale.
In the case of full fine-tuning
- The general benchmark is typically thousands to tens of thousands of high-quality training samples
- The less data available, the higher the risk of overfitting, which tends to compromise generalizability
- Larger models require more data, so the tradeoff with GPU costs must be carefully considered
When using PEFT / LoRA
- There are reported cases where even a few hundred to a few thousand samples have yielded meaningful results
- Because LoRA updates only a portion of the model's weights, it tends to suppress overfitting even with limited data
- Using QLoRA further reduces memory consumption, making it easier to experiment on a local GPU
It is also important not to overlook the fact that data quality takes priority over data quantity. It is not uncommon for 500 carefully labeled samples to outperform 10,000 noisy ones.
On the other hand, when data is still under 100 samples, it is practical to prioritize exploring RAG first. Since knowledge can be referenced immediately simply by storing documents in a vector database, value can be delivered early while keeping data collection costs low.
A phased approach—starting with a PoC using a small dataset and transitioning to fine-tuning only if accuracy fails to meet requirements—is the most effective way to minimize risk.
Can Fine-Tuning and RAG Be Combined with a Local LLM?
To put it plainly, combining fine-tuning and RAG is entirely achievable even with a local LLM. Because it does not rely on cloud APIs, it is a particularly compelling option for organizations that do not want to send confidential data to external services.
Key configuration examples for local deployment
- Base model: Open-weight models such as Llama or Mistral, fine-tuned with QLoRA
- Vector database: Chroma or Weaviate deployed on-premises
- Embedding model: Locally running models such as BGE or E5 for document vectorization
- Inference server: Ollama or vLLM to expose an API endpoint
This configuration enables a fully self-contained RAG pipeline within an internal network.
Key considerations
- GPU memory constraints are significant, and models around 7B parameters are often the practical choice
- If the language support of the fine-tuned model and the embedding model are not aligned, search accuracy for languages such as Japanese or Thai tends to degrade
- Quantization can reduce model size, but the tradeoff with accuracy must be validated
The cost reality
While there are no API fees as with cloud services, fixed costs for GPU procurement, power, and maintenance do apply. For high-frequency workloads, long-term cost advantages are more likely to materialize; conversely, for low-frequency use cases, cloud-based solutions have been reported to be more cost-effective.
The technical barrier for running a combined configuration on a local LLM is relatively high, but it is a strong option in environments with strict data sovereignty or security requirements. It is recommended to first validate at PoC scale and confirm the balance between operational costs and accuracy.
Conclusion: Choosing the Optimal Method Based on Your Goals and Resources
Fine-tuning and RAG are not an either/or proposition where one is superior to the other—they are complementary approaches to be used based on "what problem you are trying to solve." Let us revisit the comparisons made throughout this article and clarify the key decision criteria.
Cases where fine-tuning is appropriate:
- You want to align output style and format with industry standards
- You are working with a closed knowledge domain that requires no external retrieval at inference time
- You have an environment where training costs can be reduced through PEFT or LoRA
Cases where RAG is appropriate:
- Documents such as internal policies or product manuals are updated frequently
- In fields such as legal or medical, answers must cite their sources and provide clear grounds
- You want to minimize upfront investment and start from the PoC stage
Cases where combining both is effective:
- Both domain-specific expressiveness and real-time information retrieval are required
- Business workflows that require dynamic, multi-step reasoning with Agentic RAG
The starting point for decision-making is two axes: frequency of data updates and budget scale. If updates occur monthly or more frequently, RAG becomes the practical choice; if consistency in style and output format is the top priority, fine-tuning tends to have the edge.
Neither approach can reduce the risk of hallucination to zero. Designing guardrails and HITL (Human-in-the-Loop) mechanisms in parallel is what ultimately determines the quality of production deployments. The most realistic path to building results while managing risk is to first validate hypotheses through a small-scale PoC, then make scaling decisions based on empirical measurements.
Author & Supervisor
Yusuke Ishihara
Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).


