RAG (Retrieval-Augmented Generation) is a technique that improves the accuracy and currency of responses by retrieving relevant information from external knowledge sources and appending the results to the input of an LLM.
LLMs only possess knowledge up to their training cutoff date. Moreover, even with the knowledge they do have, they can be confidently wrong (hallucination). RAG has established itself as a practical solution to these two weaknesses. The mechanism is intuitive. Upon receiving a user's question, relevant documents are first retrieved from internal documents or a knowledge base. The retrieved results are then passed to the LLM along with the question. The LLM generates a response based not only on its own knowledge, but grounded in the provided documents. Since sources can be explicitly cited, verifying responses becomes straightforward. Breaking down the components of RAG, they consist of document preprocessing (chunking), vector embedding, similarity search (semantic search), and prompt construction for the LLM. Each step involves choices, and something as simple as how chunks are split can significantly impact response quality. The distinction between RAG and fine-tuning is frequently debated, but they serve different roles. RAG is a method for "having the model reference external knowledge," while fine-tuning is a method for "adjusting the model's behavior and tone." If the goal is to have the model accurately answer questions based on internal manuals, RAG is the reasonable starting point; if the goal is to standardize the format and style of responses, fine-tuning is. Many projects employ both in combination.


LoRA (Low-Rank Adaptation) is a technique that inserts low-rank delta matrices into the weight matrices of large language models and trains only those deltas, enabling fine-tuning by adding approximately 0.1–1% of the total model parameters.

LLM (Large Language Model) is a general term for neural network models pre-trained on massive amounts of text data, containing billions to trillions of parameters, capable of understanding and generating natural language with high accuracy.

A local LLM refers to an operational model in which a large language model is run directly on one's own server or PC, without going through a cloud API.

What is Human-in-the-Loop (HITL)? The Basics of "Human Participation" Design for Establishing AI-Driven Business Process Automation

QLoRA (Quantized LoRA) is a method that combines LoRA with 4-bit quantization, enabling fine-tuning of large language models even on consumer-grade GPUs.