What is PEFT (Parameter-Efficient Fine-Tuning)? A Technology That Reduces AI Model Customization Costs by 90%

Updated:March 5, 2026Published:March 5, 2026

PEFT (Parameter-Efficient Fine-Tuning) is a family of techniques that customize AI models by adjusting only a small subset of parameters instead of retraining the entire model, reducing training costs by up to 99%.

PEFT (Parameter-Efficient Fine-Tuning) allows you to achieve comparable performance while reducing trainable parameters by up to 99% or more, compared to full fine-tuning which retrains the entire AI model.

This article is aimed at CTOs, VPoEs, and IT system owners considering the business application of AI/LLMs, and explains how PEFT works, its key methods, and the key points for investment decisions. By the end of the article, you will be equipped to select the optimal PEFT method for your organization and make an informed decision on adopting AI model customization.

What is PEFT? Differences from Full Fine-Tuning

PEFT (Parameter-Efficient Fine-Tuning) is a collective term for techniques that "freeze" the majority of parameters in a pre-trained AI model and train only a small number of additional parameters.

Comparison with Full Fine-Tuning

Item	Full Fine-Tuning	PEFT
Training Target	All model parameters	A small number of added parameters (0.1–2% of the total)
Required GPU Memory	Tens to hundreds of GB	A few GB to tens of GB
Training Time	Days to weeks	Tens of minutes to hours
Model Storage Size	Tens of GB (all parameters)	A few MB to hundreds of MB (adapter only)
Risk of Catastrophic Forgetting	High	Low

For example, when applying LoRA to a 3-billion-parameter model, the trainable parameters are reduced to just 0.19% of the total (approximately 2.36 million parameters). The saved checkpoint is also around 19 MB, which is approximately 1/2,000 compared to the full model's 40 GB (reference: Hugging Face PEFT Blog).

Intuitive Understanding of the Mechanism

PEFT is similar to "teaching a new task to an expert who already has high capabilities." The expert's foundational abilities (pre-trained knowledge) remain intact, while only the incremental knowledge required for the new task is additionally learned. This allows for efficient customization while preventing "catastrophic forgetting," where foundational capabilities are lost.

Why is PEFT Attracting Attention Now?

Around 2023, the scaling of LLMs accelerated even further, making full fine-tuning an option that is "wanted but not feasible." Here are four reasons behind the rapid spread of PEFT.

1. The Growing Scale of AI Models

Recent large language models (LLMs) have reached scales of 70B to 405B parameters. Full fine-tuning of these models requires an environment equipped with multiple A100 80GB GPUs, incurring cloud GPU costs on the order of millions of yen per month. With PEFT, practical customization is possible even on consumer-grade GPUs (such as the RTX 4090, with 24GB of VRAM).

2. Soaring GPU Costs

The surge in GPU demand driven by the AI boom has caused cloud GPU prices to trend upward. Since PEFT significantly reduces the required computational resources, it directly translates to optimized GPU costs.

3. Avoiding Catastrophic Forgetting

With full fine-tuning, there is a risk that the model will "forget" its pre-training knowledge in the process of adapting to a new task. Since PEFT freezes the original parameters, it allows you to add new capabilities while preserving existing ones.

4. Improved Efficiency for Multitasking

Adapters (additional parameters) trained with PEFT are saved as files of just a few MB. By simply swapping task-specific adapters for a single base model, you can handle multiple tasks such as translation, summarization, and classification. This eliminates the need to maintain multiple full models, significantly reducing storage and deployment costs.

Comparing Major PEFT Methods

"Which PEFT should I choose?" is the first wall you'll run into. Here, we summarize the four major methods in a single comparison table, then present a selection flowchart.

Method Comparison Table

Method	Mechanism	Memory Efficiency	Performance	Ease of Implementation	Main Use Cases
LoRA	Adds low-rank matrices to weight matrices	◎	◎	◎	LLM, image generation, speech
QLoRA	LoRA + 4-bit quantization	◎◎	◎	○	Memory-constrained environments
Adapter	Inserts adapter modules into Transformer layers	○	◎	○	General NLP tasks
Prompt Tuning	Adds soft prompts to input	◎	○	◎	Text classification & generation
Prefix Tuning	Adds prefix vectors to each layer	◎	○	○	Text generation

Method Selection Flowchart

Q1: What is the size of the base model?
├── 7B or less → LoRA (standard choice)
├── 7B–70B → QLoRA (memory reduction is important)
└── 70B or more → QLoRA + DeepSpeed

Q2: Can you modify the internal structure of the model?
├── Yes → LoRA / Adapter
└── No (API only) → Prompt Tuning

Q3: Do you want to switch between multiple tasks?
├── Yes → LoRA (easy to swap adapters)
└── No → Any method will work

Guidelines for Selecting Each Method

When in doubt, go with LoRA: This is the method most likely to be the first choice in many cases. You can get started with just a few lines of code using Hugging Face's PEFT library
If memory is insufficient, use QLoRA: Thanks to 4-bit quantization, training a 7B model is possible even on consumer-grade GPUs with 12GB VRAM
For API-only access, use Prompt Tuning: This is the only applicable method in environments where you cannot access the model's weights

A Clear Explanation of How LoRA Works

LoRA (Low-Rank Adaptation) is a method published by Microsoft Research in 2021 (ref: Hu et al., 2021), and is currently the most widely used PEFT technique.

Intuitive Explanation of Low-Rank Matrix Factorization

The weight matrix W of a Transformer model is enormous, but task-specific changes are concentrated in its "low-rank" components. LoRA leverages this property by adding two small matrices A and B instead of directly updating the original weight matrix W.

Original computation: y = W × x
After applying LoRA: y = W × x + (A × B) × x

Since matrices A and B are each much smaller than the original matrix (depending on rank r), the number of trainable parameters is significantly reduced.

How to Choose the Rank (r)

Rank Value	Number of Parameters	Use Case
r = 4〜8	Minimal	Simple tasks (text classification, etc.)
r = 16〜32	Standard	General customization
r = 64〜128	Large	Complex tasks (high-quality image generation, etc.)

As the rank increases, expressiveness improves, but the risk of overfitting also rises. In most cases, a range of r = 8〜32 provides sufficient performance.

Differences from QLoRA

QLoRA is a method that combines LoRA with 4-bit quantization. By applying LoRA with the base model weights compressed from 32-bit to 4-bit, it can further reduce VRAM usage by 50–75%.

Item	LoRA	QLoRA
Base model precision	16-bit / 32-bit	4-bit
Additional parameter precision	16-bit	16-bit
Required VRAM for a 6.7B parameter model	~16 GB	~6 GB
Training speed	Fast	Slightly slower (quantization overhead)
Performance	Baseline	Nearly equivalent to LoRA

Common Mistakes and Precautions

PEFT is easy to get started with, but there are also "pitfalls that come with that ease." Here are 4 common patterns, including failures we actually encountered.

1. Overfitting Due to Excessively Large Rank

Problem: Excessively increasing the rank in pursuit of expressiveness leads to overfitting to the training data and degraded generalization performance.

Workaround: Start with r = 8–16, then adjust incrementally while monitoring performance on validation data. Avoid increasing the number of epochs too much, and compare performance at intermediate checkpoints.

2. Insufficient Quality of Training Data

Problem: When performing PEFT with a small amount of training data, data quality directly impacts the results. Noisy or biased data will degrade performance.

Workaround: Prioritize data quality over data quantity. 100 high-quality data points often outperform 1,000 low-quality ones.

3. Error in Base Model Selection

Problem: Applying PEFT to a base model that is unsuitable for the task will not yield sufficient performance. PEFT is a technique for "fine-tuning" a model's existing capabilities, not for adding capabilities that do not exist.

Workaround: Verify in advance that the base model has the foundational capabilities required for the task. For Japanese-language tasks, select a Japanese-compatible model; for coding tasks, select a code-specialized model.

4. Mismatch Between Training Environment and Model Accuracy

Problem: Depending on the GPU architecture, training may become unstable with certain numerical precisions (e.g., fp16).

Workaround: Select a precision setting appropriate for the GPU architecture being used. For example, RTX 40-series (Ada Lovelace) GPUs natively support bf16, which may provide more stable training than fp16.

In Which Industries Does PEFT Shine? A Practical Guide

PEFT is particularly effective in industries that have their own proprietary data and terminology. Here, we explore specific scenarios for three representative industries. Points that are common to other industries are summarized at the end in "Cross-Industry Points."

Manufacturing: Advancing Quality Inspection and Equipment Maintenance

In manufacturing environments, product images and equipment data often contain company-specific patterns that general-purpose models frequently fail to handle adequately.

Use Case	PEFT Application Method	Expected Benefits
Automated visual inspection	Train defect patterns of in-house products on an image classification model using LoRA	Improved inspection accuracy, reduced workload for inspectors
Predictive detection of equipment anomalies	Adapt a time-series data model to sensor data from in-house equipment	Reduction in unplanned downtime
Automatic summarization of technical documents	Train an LLM on internal technical terminology to auto-generate meeting minutes and reports	Reduced man-hours for document creation

In the manufacturing industry, products and equipment differ from factory to factory, making it efficient to share a base model while creating factory-specific LoRA adapters for each site.

Medical/Healthcare: Natural Language Processing of Clinical Data

The medical field contains many specialized terms, making it an area where general-purpose LLMs often struggle to achieve sufficient accuracy. PEFT enables low-cost, medicine-specific customization.

Use Case	PEFT Application Method	Expected Effect
Summarizing medical records and referral letters	Train LLMs on medical terminology and abbreviations via PEFT	Improved summarization accuracy, reduced physician workload
Assisted classification of medical images	Adapt image classification models to facility-specific imaging conditions	Improved screening accuracy
Support for multilingual medical interpretation	Incorporate medical terminology dictionaries into translation models via PEFT	Improved communication in multilingual environments across Southeast Asia

Note: Medical AI may be subject to regulations in each country (Pharmaceutical and Medical Device Act, FDA, etc.). When deploying PEFT-created models in clinical settings, be sure to verify the regulatory requirements of the relevant authorities.

Finance: Compliance and Risk Analysis

In the financial industry, there is a constraint that confidential data cannot be shared externally, making PEFT a highly compatible approach as it operates entirely within an on-premises environment.

Use Case	PEFT Application Method	Expected Benefits
Fraud transaction detection	Adapt classification models to in-house transaction patterns	Reduction in false positive rates, improvement in detection accuracy
Automated reading of screening documents	Train LLMs on contract and application form formats using PEFT	Reduction in screening lead time
Automated regulatory report generation	Adapt LLMs to authority reporting formats and terminology	Reduction in report creation workload

In the financial industry, the advantage of PEFT — enabling model training on-premises without sending data to the cloud — is particularly valuable. With QLoRA, in-house model customization is possible even on a GPU with 12GB VRAM.

Cross-Industry Key Points

Beyond the three industries mentioned above, PEFT is being utilized across a wide range of sectors, including distribution, construction, and tourism. Below is a summary of success patterns common across industries.

Distribution & Retail — By switching adapters for each product category, it is possible to optimize the accuracy of demand forecasting and CS chatbots on a per-product basis. An operational model that prepares separate adapters for food, home appliances, and apparel on a single base model offers excellent cost efficiency.

Construction — Since conditions vary from site to site, an operational approach of swapping adapters by construction type is effective. As adapters are lightweight at just a few MB, they can also run on edge devices at on-site offices.

Tourism & Hospitality — By dynamically switching between language-specific adapters (Japanese, Thai, English, etc.), multilingual chatbots and review analysis can be realized at low cost.

The following four points are common across all of these sectors:

For industries where data confidentiality is required, on-premises PEFT is effective
For industries where conditions vary by location or site, sharing a base model with location-specific adapters is efficient
For industries with global operations, dynamically switching between language-specific adapters is cost-optimal
Regardless of industry, the recommended approach is to start with a PoC for a single use case → validate results → then roll out horizontally

Our PEFT Use Cases

We utilize LoRA for customizing image generation AI. The following are practical examples of applying LoRA to Stable Diffusion-based models.

Environment and Prerequisites

Item	Details
Training Tool	kohya-ss/sd-scripts (SDXL compatible)
GPU	RTX 40 series (VRAM 12GB) — consumer hardware
Training Data	87 images + text captions
LoRA Parameters	network_dim=32, network_alpha=16
Optimizer	AdamW 8bit (VRAM saving)
Numerical Precision	bf16 (optimized for RTX 40 series)

Before / After

Metric	Full FT (estimated)	LoRA applied (measured)
Required VRAM	24 GB or more	12 GB (50% or less)
Training time	Several hours or more	Approx. 40 minutes
Model size	6.5 GB (full model)	325 MB (adapter only, approx. 1/20)
Output quality	Baseline	Equivalent or better (stable at weight 0.7)

Lessons Learned

Matching GPU architecture and precision settings is critical: Use bf16 on RTX 40 series GPUs. Using fp16 caused NaN (numerical divergence) during training. Selecting the appropriate precision setting for your GPU generation is key to stable training.
The base model used for training and inference must always match: Running inference with a different model will prevent the customization effects from being reflected correctly.
Do not neglect cache management: If outdated cache remains after changing the base model, it will negatively impact training results. Always make sure to clear the cache when switching models.
There is an optimal value for LoRA weight (application strength): Around 0.7 offers the best balance between quality and flexibility. At 0.9 or above, customization tends to become excessive, leading to degraded image quality.

Business Impact

By adopting PEFT, model customization has become possible using in-house consumer GPUs without subscribing to expensive GPU cloud environments. This demonstrates that even small and medium-sized enterprises and startups with limited GPU resources can bring AI model customization in-house.

FAQ

Here is a summary of frequently asked questions about considering the introduction of PEFT.

Q1: How should PEFT and RAG (Retrieval-Augmented Generation) be used differently?

PEFT and RAG serve different purposes. PEFT is a technique that changes the model's "behavior," improving output style and accuracy on specific tasks. On the other hand, RAG is a technique that supplements the model's "knowledge," retrieving up-to-date information from external databases and providing it to the model.

Criteria	PEFT is appropriate	RAG is appropriate
Want to change the model's output style	✅	—
Want to reflect the latest information	—	✅
Want to enhance expertise in a specific domain	✅	✅ (can be used together)
Cost	GPU required only during training	Search cost incurred at every inference

In many cases, combining PEFT and RAG yields the best results.

Q2: What GPU specifications are required for PEFT?

With QLoRA, it is possible to train 7B parameter models on consumer-grade GPUs with 12GB VRAM (e.g., RTX 4070). With LoRA alone, 16–24GB VRAM (e.g., RTX 4090) is recommended. For models with 70B or more parameters, server-grade GPUs such as the A100 80GB may be required.

Q3: Can you customize a Japanese LLM with PEFT?

Yes, it is possible. By applying PEFT to a Japanese-compatible base model (e.g., Llama 3 Japanese version, ELYZA, etc.), you can perform customization specialized for Japanese language tasks. The Hugging Face PEFT library is also compatible with Japanese models.

Q4: Are there any licensing considerations for commercial use?

Always check the base model's license. Although the LoRA adapter itself is a standalone file, it is used in combination with the base model during inference, so the base model's license terms apply. If you plan to use it commercially, it is safest to choose a model with an Apache 2.0 or MIT license.

Summary & Next Steps

PEFT is a technique that significantly lowers the cost barrier for AI model customization.

It can reduce trainable parameters by up to 99% or more, dramatically cutting GPU costs and training time. If you're unsure which method to choose, start with LoRA. QLoRA is effective in memory-constrained environments. As demonstrated in the our company case study introduced in this article, practical customization is entirely achievable even on consumer-grade GPUs (12GB VRAM).

PEFT and RAG are not competing technologies — combining them allows you to maximize the performance of your custom AI.

As a next step, start by organizing your own use cases and identifying which tasks require model customization. The standard low-risk approach is to follow this flow: base model selection → LoRA + small dataset PoC → production deployment.

If you have any questions about AI model customization, please contact our company. For more details on AI/DX solutions, visit enison.ai as well.

Author & Supervisor

Yusuke Ishihara

Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).