QLoRA (Quantized LoRA) is a method that combines LoRA with 4-bit quantization, enabling fine-tuning of large language models even on consumer-grade GPUs.
QLoRA, announced in 2023, was a direct answer to the urgent voices from practitioners saying "we don't have enough GPUs."
The core idea is simple: quantize the base model weights to 4-bit to dramatically reduce GPU memory consumption, then train only the LoRA adapters in 16-bit. In other words, it adopts a two-stage design philosophy of "lightweight loading, precise training."
In concrete numbers, loading a 65B parameter model at full precision requires multiple A100 80GB GPUs, but QLoRA fits it onto a single card. For 7B models, training can even run on an RTX 3090 (24GB) or RTX 4090. The cost of renting GPU instances in the cloud can often be reduced to less than 1/10 of that of full fine-tuning.
However, there are caveats. Accuracy degradation from 4-bit quantization is not zero. Based on the author's own experiments, the difference from full-precision LoRA was negligible for simple classification and summarization tasks, but a score drop of around 1–3% was observed for tasks requiring mathematical reasoning or logical development in long-form text. In practice, the rational approach seems to be: "start with QLoRA, and switch to full-precision LoRA if the accuracy is insufficient."


LoRA (Low-Rank Adaptation) is a technique that inserts low-rank delta matrices into the weight matrices of large language models and trains only those deltas, enabling fine-tuning by adding approximately 0.1–1% of the total model parameters.

A memory compression technology for LLMs developed by Google. It reduces memory consumption by up to 1/6 through quantization and accelerates inference speed by up to 8 times.

An optimization technique that compresses model size by reducing parameter precision from 16-bit to 4-bit or similar, enabling inference with limited computational resources.

What is PEFT (Parameter-Efficient Fine-Tuning)? A Technology That Reduces AI Model Customization Costs by 90%

A local LLM refers to an operational model in which a large language model is run directly on one's own server or PC, without going through a cloud API.