QLoRA (Quantized LoRA) is a method that combines LoRA with 4-bit quantization, enabling fine-tuning of large language models even on consumer-grade GPUs.
QLoRA, announced in 2023, was a direct answer to the urgent voices from practitioners saying "we don't have enough GPUs." The core idea is simple: quantize the base model weights to 4-bit to dramatically reduce GPU memory consumption, then train only the LoRA adapters in 16-bit. In other words, it adopts a two-stage design philosophy of "lightweight loading, precise training." In concrete numbers, loading a 65B parameter model at full precision requires multiple A100 80GB GPUs, but QLoRA fits it onto a single card. For 7B models, training can even run on an RTX 3090 (24GB) or RTX 4090. The cost of renting GPU instances in the cloud can often be reduced to less than 1/10 of that of full fine-tuning. However, there are caveats. Accuracy degradation from 4-bit quantization is not zero. Based on the author's own experiments, the difference from full-precision LoRA was negligible for simple classification and summarization tasks, but a score drop of around 1–3% was observed for tasks requiring mathematical reasoning or logical development in long-form text. In practice, the rational approach seems to be: "start with QLoRA, and switch to full-precision LoRA if the accuracy is insufficient."


LoRA (Low-Rank Adaptation) is a technique that inserts low-rank delta matrices into the weight matrices of large language models and trains only those deltas, enabling fine-tuning by adding approximately 0.1–1% of the total model parameters.

A local LLM refers to an operational model in which a large language model is run directly on one's own server or PC, without going through a cloud API.

RLHF is a reinforcement learning method that uses human feedback as a reward, while RLVR is a reinforcement learning method that uses verifiable correct answers as a reward; both are used to align LLM outputs with human expectations.


Local LLM / SLM Deployment Comparison — AI Utilization Without Cloud API Dependency

LLM (Large Language Model) is a general term for neural network models pre-trained on massive amounts of text data, containing billions to trillions of parameters, capable of understanding and generating natural language with high accuracy.