PEFT (Parameter-Efficient Fine-Tuning) is a collective term for fine-tuning methods that adapt a large language model to a specific task with minimal computational resources and data, by updating only a subset of the model's parameters rather than all of them.
Attempting to fine-tune an LLM with tens of billions of parameters by training all parameters from scratch would take more than half a day even with 8 A100s running in parallel. In environments with limited budgets and hardware, this approach is often not even a viable option. PEFT breaks through this barrier by "freezing the majority of the model and training only a small number of added parameters." The major techniques can be summarized as follows: - **LoRA** (Low-Rank Adaptation) — Inserts low-rank deltas into weight matrices. The added parameters account for roughly 0.1–1% of the entire model. Currently the most widely adopted approach. - **QLoRA** — Combines LoRA with 4-bit quantization to further compress GPU memory to less than half. Enables training of 7B models even on consumer GPUs. - **Prefix Tuning / Prompt Tuning** — Adds trainable vectors on the input side. Makes no modifications to the model itself whatsoever. - **Adapter** — Inserts bottleneck layers between Transformer layers. Was the dominant approach before LoRA emerged, but has somewhat faded from prominence today. The author's team fine-tuned a 7B parameter LLM using LoRA on a single A100 in approximately 3 hours, improving task-specific accuracy by 15–20% over the base model. Full fine-tuning would have required A100×8 for 12 hours on the same task, making the cost difference stark. That said, PEFT is not a silver bullet. It is difficult to use PEFT alone to acquire capabilities the model does not originally possess—such as generation in unsupported languages—and in such cases it becomes necessary to combine it with Continued Pre-training. A common question is "which should I use, PEFT or RAG?"—but the two serve fundamentally different roles. RAG handles retrieval of external knowledge, while PEFT handles adjustment of the model's behavior and style. A practical starting point for choosing between them is: use RAG when you need accurate citation of internal knowledge, and use PEFT when you want to standardize the tone or format of responses. Combining both is not uncommon either.


Fine-tuning refers to the process of providing additional training data to a pre-trained machine learning model in order to adapt it to a specific task or domain.

Prompt engineering is the practice of designing the structure, phrasing, and context of input text (prompts) in order to elicit desired outputs from LLMs (Large Language Models).

An open-weight model is a language model whose trained weights (parameters) are publicly released and can be freely downloaded for use in inference and fine-tuning.

What is PEFT (Parameter-Efficient Fine-Tuning)? A Technology That Reduces AI Model Customization Costs by 90%

MoE (Mixture of Experts) is an architecture that contains multiple "expert" subnetworks within a model, activating only a subset of them for each input, thereby increasing the total number of parameters while keeping inference costs low.