QLoRA

QLoRA

QLoRA (Quantized LoRA) is a method that combines LoRA with 4-bit quantization, enabling fine-tuning of large language models even on consumer-grade GPUs.

QLoRA, announced in 2023, was a direct answer to the urgent voices from practitioners saying "we don't have enough GPUs."

The core idea is simple: quantize the base model weights to 4-bit to dramatically reduce GPU memory consumption, then train only the LoRA adapters in 16-bit. In other words, it adopts a two-stage design philosophy of "lightweight loading, precise training."

In concrete numbers, loading a 65B parameter model at full precision requires multiple A100 80GB GPUs, but QLoRA fits it onto a single card. For 7B models, training can even run on an RTX 3090 (24GB) or RTX 4090. The cost of renting GPU instances in the cloud can often be reduced to less than 1/10 of that of full fine-tuning.

However, there are caveats. Accuracy degradation from 4-bit quantization is not zero. Based on the author's own experiments, the difference from full-precision LoRA was negligible for simple classification and summarization tasks, but a score drop of around 1–3% was observed for tasks requiring mathematical reasoning or logical development in long-form text. In practice, the rational approach seems to be: "start with QLoRA, and switch to full-precision LoRA if the accuracy is insufficient."