LoRA (Low-Rank Adaptation) is a technique that inserts low-rank delta matrices into the weight matrices of large language models and trains only those deltas, enabling fine-tuning by adding approximately 0.1–1% of the total model parameters.
LoRA adopts a structure in which the product of low-rank matrices A×B is added to the weight matrix W in each layer of the Transformer. Since the original weights W remain frozen and only the added A and B are trained, the number of trainable parameters is kept to approximately 0.1–1% of the entire model. The rank r is typically set somewhere in the range of 4–64; a smaller r reduces the parameter count but involves a trade-off with expressive capacity. On the implementation side, Hugging Face's PEFT library and Unsloth are widely used, and can be integrated into existing training pipelines with just a few additional lines of code. Trained LoRA adapters can be saved as separate files from the base model (on the order of tens of MB), and by swapping adapters for each task, a single base model can be reused across multiple applications. When GPU memory permits, it is also possible to merge the adapter into the base model to maintain inference speed. In the author's environment, r=16 tends to strike a good balance between accuracy and efficiency across many tasks, and is often adopted as the initial setting. However, it is difficult for LoRA alone to acquire capabilities the model did not originally possess—such as adding support for an unsupported language—and in such cases, combining it with continual pre-training becomes necessary.


QLoRA (Quantized LoRA) is a method that combines LoRA with 4-bit quantization, enabling fine-tuning of large language models even on consumer-grade GPUs.

A local LLM refers to an operational model in which a large language model is run directly on one's own server or PC, without going through a cloud API.

LLM (Large Language Model) is a general term for neural network models pre-trained on massive amounts of text data, containing billions to trillions of parameters, capable of understanding and generating natural language with high accuracy.

What is PEFT (Parameter-Efficient Fine-Tuning)? A Technology That Reduces AI Model Customization Costs by 90%

RLHF is a reinforcement learning method that uses human feedback as a reward, while RLVR is a reinforcement learning method that uses verifiable correct answers as a reward; both are used to align LLM outputs with human expectations.