LoRA

LoRA

LoRA (Low-Rank Adaptation) is a technique that inserts low-rank delta matrices into the weight matrices of large language models and trains only those deltas, enabling fine-tuning by adding approximately 0.1–1% of the total model parameters.

LoRA adopts a structure in which the product of low-rank matrices A×B is added to the weight matrix W in each layer of the Transformer. Since the original weights W remain frozen and only the added A and B are trained, the number of trainable parameters is kept to approximately 0.1–1% of the entire model. The rank r is typically set somewhere in the range of 4–64; a smaller r reduces the parameter count but involves a trade-off with expressive capacity.

On the implementation side, Hugging Face's PEFT library and Unsloth are widely used, and can be integrated into existing training pipelines with just a few additional lines of code. Trained LoRA adapters can be saved as separate files from the base model (on the order of tens of MB), and by swapping adapters for each task, a single base model can be reused across multiple applications. When GPU memory permits, it is also possible to merge the adapter into the base model to maintain inference speed.

In the author's environment, r=16 tends to strike a good balance between accuracy and efficiency across many tasks, and is often adopted as the initial setting.

However, it is difficult for LoRA alone to acquire capabilities the model did not originally possess—such as adding support for an unsupported language—and in such cases, combining it with continual pre-training becomes necessary.