
PEFT (Parameter-Efficient Fine-Tuning) allows you to achieve comparable performance while reducing trainable parameters by up to 99% or more, compared to full fine-tuning which retrains the entire AI model.
This article is aimed at CTOs, VPoEs, and IT system owners considering the business application of AI/LLMs, and explains how PEFT works, its key methods, and the key points for investment decisions. By the end of the article, you will be equipped to select the optimal PEFT method for your organization and make an informed decision on adopting AI model customization.
PEFT (Parameter-Efficient Fine-Tuning) is a collective term for techniques that "freeze" the majority of parameters in a pre-trained AI model and train only a small number of additional parameters.
| Item | Full Fine-Tuning | PEFT |
|---|---|---|
| Training Target | All model parameters | A small number of added parameters (0.1–2% of the total) |
| Required GPU Memory | Tens to hundreds of GB | A few GB to tens of GB |
| Training Time | Days to weeks | Tens of minutes to hours |
| Model Storage Size | Tens of GB (all parameters) | A few MB to hundreds of MB (adapter only) |
| Risk of Catastrophic Forgetting | High | Low |
For example, when applying LoRA to a 3-billion-parameter model, the trainable parameters are reduced to just 0.19% of the total (approximately 2.36 million parameters). The saved checkpoint is also around 19 MB, which is approximately 1/2,000 compared to the full model's 40 GB (reference: Hugging Face PEFT Blog).
PEFT is similar to "teaching a new task to an expert who already has high capabilities." The expert's foundational abilities (pre-trained knowledge) remain intact, while only the incremental knowledge required for the new task is additionally learned. This allows for efficient customization while preventing "catastrophic forgetting," where foundational capabilities are lost.
Around 2023, the scaling of LLMs accelerated even further, making full fine-tuning an option that is "wanted but not feasible." Here are four reasons behind the rapid spread of PEFT.
Recent large language models (LLMs) have reached scales of 70B to 405B parameters. Full fine-tuning of these models requires an environment equipped with multiple A100 80GB GPUs, incurring cloud GPU costs on the order of millions of yen per month. With PEFT, practical customization is possible even on consumer-grade GPUs (such as the RTX 4090, with 24GB of VRAM).
The surge in GPU demand driven by the AI boom has caused cloud GPU prices to trend upward. Since PEFT significantly reduces the required computational resources, it directly translates to optimized GPU costs.
With full fine-tuning, there is a risk that the model will "forget" its pre-training knowledge in the process of adapting to a new task. Since PEFT freezes the original parameters, it allows you to add new capabilities while preserving existing ones.
Adapters (additional parameters) trained with PEFT are saved as files of just a few MB. By simply swapping task-specific adapters for a single base model, you can handle multiple tasks such as translation, summarization, and classification. This eliminates the need to maintain multiple full models, significantly reducing storage and deployment costs.
"Which PEFT should I choose?" is the first wall you'll run into. Here, we summarize the four major methods in a single comparison table, then present a selection flowchart.
| Method | Mechanism | Memory Efficiency | Performance | Ease of Implementation | Main Use Cases |
|---|---|---|---|---|---|
| LoRA | Adds low-rank matrices to weight matrices | ◎ | ◎ | ◎ | LLM, image generation, speech |
| QLoRA | LoRA + 4-bit quantization | ◎◎ | ◎ | ○ | Memory-constrained environments |
| Adapter | Inserts adapter modules into Transformer layers | ○ | ◎ | ○ | General NLP tasks |
| Prompt Tuning | Adds soft prompts to input | ◎ | ○ | ◎ | Text classification & generation |
| Prefix Tuning | Adds prefix vectors to each layer | ◎ | ○ | ○ | Text generation |
Q1: What is the size of the base model? ├── 7B or less → LoRA (standard choice) ├── 7B–70B → QLoRA (memory reduction is important) └── 70B or more → QLoRA + DeepSpeed Q2: Can you modify the internal structure of the model? ├── Yes → LoRA / Adapter └── No (API only) → Prompt Tuning Q3: Do you want to switch between multiple tasks? ├── Yes → LoRA (easy to swap adapters) └── No → Any method will work
LoRA (Low-Rank Adaptation) is a method published by Microsoft Research in 2021 (ref: Hu et al., 2021), and is currently the most widely used PEFT technique.
The weight matrix W of a Transformer model is enormous, but task-specific changes are concentrated in its "low-rank" components. LoRA leverages this property by adding two small matrices A and B instead of directly updating the original weight matrix W.
Original computation: y = W × x After applying LoRA: y = W × x + (A × B) × x
Since matrices A and B are each much smaller than the original matrix (depending on rank r), the number of trainable parameters is significantly reduced.
| Rank Value | Number of Parameters | Use Case |
|---|---|---|
| r = 4〜8 | Minimal | Simple tasks (text classification, etc.) |
| r = 16〜32 | Standard | General customization |
| r = 64〜128 | Large | Complex tasks (high-quality image generation, etc.) |
As the rank increases, expressiveness improves, but the risk of overfitting also rises. In most cases, a range of r = 8〜32 provides sufficient performance.
QLoRA is a method that combines LoRA with 4-bit quantization. By applying LoRA with the base model weights compressed from 32-bit to 4-bit, it can further reduce VRAM usage by 50–75%.
| Item | LoRA | QLoRA |
|---|---|---|
| Base model precision | 16-bit / 32-bit | 4-bit |
| Additional parameter precision | 16-bit | 16-bit |
| Required VRAM for a 6.7B parameter model | ~16 GB | ~6 GB |
| Training speed | Fast | Slightly slower (quantization overhead) |
| Performance | Baseline | Nearly equivalent to LoRA |
PEFT is easy to get started with, but there are also "pitfalls that come with that ease." Here are 4 common patterns, including failures we actually encountered.
Problem: Excessively increasing the rank in pursuit of expressiveness leads to overfitting to the training data and degraded generalization performance.
Workaround: Start with r = 8–16, then adjust incrementally while monitoring performance on validation data. Avoid increasing the number of epochs too much, and compare performance at intermediate checkpoints.
Problem: When performing PEFT with a small amount of training data, data quality directly impacts the results. Noisy or biased data will degrade performance.
Workaround: Prioritize data quality over data quantity. 100 high-quality data points often outperform 1,000 low-quality ones.
Problem: Applying PEFT to a base model that is unsuitable for the task will not yield sufficient performance. PEFT is a technique for "fine-tuning" a model's existing capabilities, not for adding capabilities that do not exist.
Workaround: Verify in advance that the base model has the foundational capabilities required for the task. For Japanese-language tasks, select a Japanese-compatible model; for coding tasks, select a code-specialized model.
Problem: Depending on the GPU architecture, training may become unstable with certain numerical precisions (e.g., fp16).
Workaround: Select a precision setting appropriate for the GPU architecture being used. For example, RTX 40-series (Ada Lovelace) GPUs natively support bf16, which may provide more stable training than fp16.
PEFT is particularly effective in industries that have their own proprietary data and terminology. Here, we explore specific scenarios for three representative industries. Points that are common to other industries are summarized at the end in "Cross-Industry Points."
In manufacturing environments, product images and equipment data often contain company-specific patterns that general-purpose models frequently fail to handle adequately.
| Use Case | PEFT Application Method | Expected Benefits |
|---|---|---|
| Automated visual inspection | Train defect patterns of in-house products on an image classification model using LoRA | Improved inspection accuracy, reduced workload for inspectors |
| Predictive detection of equipment anomalies | Adapt a time-series data model to sensor data from in-house equipment | Reduction in unplanned downtime |
| Automatic summarization of technical documents | Train an LLM on internal technical terminology to auto-generate meeting minutes and reports | Reduced man-hours for document creation |
In the manufacturing industry, products and equipment differ from factory to factory, making it efficient to share a base model while creating factory-specific LoRA adapters for each site.
The medical field contains many specialized terms, making it an area where general-purpose LLMs often struggle to achieve sufficient accuracy. PEFT enables low-cost, medicine-specific customization.
| Use Case | PEFT Application Method | Expected Effect |
|---|---|---|
| Summarizing medical records and referral letters | Train LLMs on medical terminology and abbreviations via PEFT | Improved summarization accuracy, reduced physician workload |
| Assisted classification of medical images | Adapt image classification models to facility-specific imaging conditions | Improved screening accuracy |
| Support for multilingual medical interpretation | Incorporate medical terminology dictionaries into translation models via PEFT | Improved communication in multilingual environments across Southeast Asia |
Note: Medical AI may be subject to regulations in each country (Pharmaceutical and Medical Device Act, FDA, etc.). When deploying PEFT-created models in clinical settings, be sure to verify the regulatory requirements of the relevant authorities.
In the financial industry, there is a constraint that confidential data cannot be shared externally, making PEFT a highly compatible approach as it operates entirely within an on-premises environment.
| Use Case | PEFT Application Method | Expected Benefits |
|---|---|---|
| Fraud transaction detection | Adapt classification models to in-house transaction patterns | Reduction in false positive rates, improvement in detection accuracy |
| Automated reading of screening documents | Train LLMs on contract and application form formats using PEFT | Reduction in screening lead time |
| Automated regulatory report generation | Adapt LLMs to authority reporting formats and terminology | Reduction in report creation workload |
In the financial industry, the advantage of PEFT — enabling model training on-premises without sending data to the cloud — is particularly valuable. With QLoRA, in-house model customization is possible even on a GPU with 12GB VRAM.
Beyond the three industries mentioned above, PEFT is being utilized across a wide range of sectors, including distribution, construction, and tourism. Below is a summary of success patterns common across industries.
Distribution & Retail — By switching adapters for each product category, it is possible to optimize the accuracy of demand forecasting and CS chatbots on a per-product basis. An operational model that prepares separate adapters for food, home appliances, and apparel on a single base model offers excellent cost efficiency.
Construction — Since conditions vary from site to site, an operational approach of swapping adapters by construction type is effective. As adapters are lightweight at just a few MB, they can also run on edge devices at on-site offices.
Tourism & Hospitality — By dynamically switching between language-specific adapters (Japanese, Thai, English, etc.), multilingual chatbots and review analysis can be realized at low cost.
The following four points are common across all of these sectors:
At Unimon, we utilize LoRA for customizing image generation AI. The following are practical examples of applying LoRA to Stable Diffusion-based models.
| Item | Details |
|---|---|
| Training Tool | kohya-ss/sd-scripts (SDXL compatible) |
| GPU | RTX 40 series (VRAM 12GB) — consumer hardware |
| Training Data | 87 images + text captions |
| LoRA Parameters | network_dim=32, network_alpha=16 |
| Optimizer | AdamW 8bit (VRAM saving) |
| Numerical Precision | bf16 (optimized for RTX 40 series) |
| Metric | Full FT (estimated) | LoRA applied (measured) |
|---|---|---|
| Required VRAM | 24 GB or more | 12 GB (50% or less) |
| Training time | Several hours or more | Approx. 40 minutes |
| Model size | 6.5 GB (full model) | 325 MB (adapter only, approx. 1/20) |
| Output quality | Baseline | Equivalent or better (stable at weight 0.7) |
By adopting PEFT, model customization has become possible using in-house consumer GPUs without subscribing to expensive GPU cloud environments. This demonstrates that even small and medium-sized enterprises and startups with limited GPU resources can bring AI model customization in-house.
Here is a summary of frequently asked questions about considering the introduction of PEFT.
PEFT and RAG serve different purposes. PEFT is a technique that changes the model's "behavior," improving output style and accuracy on specific tasks. On the other hand, RAG is a technique that supplements the model's "knowledge," retrieving up-to-date information from external databases and providing it to the model.
| Criteria | PEFT is appropriate | RAG is appropriate |
|---|---|---|
| Want to change the model's output style | ✅ | — |
| Want to reflect the latest information | — | ✅ |
| Want to enhance expertise in a specific domain | ✅ | ✅ (can be used together) |
| Cost | GPU required only during training | Search cost incurred at every inference |
In many cases, combining PEFT and RAG yields the best results.
With QLoRA, it is possible to train 7B parameter models on consumer-grade GPUs with 12GB VRAM (e.g., RTX 4070). With LoRA alone, 16–24GB VRAM (e.g., RTX 4090) is recommended. For models with 70B or more parameters, server-grade GPUs such as the A100 80GB may be required.
Yes, it is possible. By applying PEFT to a Japanese-compatible base model (e.g., Llama 3 Japanese version, ELYZA, etc.), you can perform customization specialized for Japanese language tasks. The Hugging Face PEFT library is also compatible with Japanese models.
Always check the base model's license. Although the LoRA adapter itself is a standalone file, it is used in combination with the base model during inference, so the base model's license terms apply. If you plan to use it commercially, it is safest to choose a model with an Apache 2.0 or MIT license.
PEFT is a technique that significantly lowers the cost barrier for AI model customization.
It can reduce trainable parameters by up to 99% or more, dramatically cutting GPU costs and training time. If you're unsure which method to choose, start with LoRA. QLoRA is effective in memory-constrained environments. As demonstrated in the Unimon case study introduced in this article, practical customization is entirely achievable even on consumer-grade GPUs (12GB VRAM).
PEFT and RAG are not competing technologies — combining them allows you to maximize the performance of your custom AI.
As a next step, start by organizing your own use cases and identifying which tasks require model customization. The standard low-risk approach is to follow this flow: base model selection → LoRA + small dataset PoC → production deployment.
If you have any questions about AI model customization, please contact Unimon. For more details on AI/DX solutions, visit enison.ai as well.
Yusuke Ishihara
Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).