LLM (Large Language Model) is a general term for neural network models pre-trained on massive amounts of text data, containing billions to trillions of parameters, capable of understanding and generating natural language with high accuracy.
Since the public release of ChatGPT in November 2022, the term LLM has spread not only among engineers but to the general public as well. However, the essence conveyed by the name "large language model" is simple: it is nothing more than "a model trained by feeding it large amounts of text and repeatedly having it predict the next word." What makes LLMs fascinating is that this straightforward learning objective gives rise to a diverse range of emergent capabilities—translation, summarization, code generation, reasoning, and more—while at the same time, theoretical understanding has yet to fully catch up. To give a concrete sense of scale: GPT-3 has 175 billion parameters (2020), Llama 3 has 70 billion parameters (2024), and GPT-4, though undisclosed, is estimated to exceed 1 trillion. While there is a general tendency for models to become more capable as parameter count increases, the fact that Llama 3 70B outperforms GPT-3 175B on many benchmarks demonstrates that the quality of training data and improvements in architecture are equally, if not more, important. There are three main routes for using LLMs in practice. The first is **via API**. This involves calling models from OpenAI or Anthropic directly. It is the most straightforward approach, but the challenges lie in the fact that data is sent to external parties and in managing costs under a pay-as-you-go pricing model. The second is **combining with RAG**. By retrieving internal documents and passing them to the LLM, this approach leverages internal knowledge while suppressing hallucinations (outputs that differ from the facts). Since the model itself is not modified, the barrier to adoption is low. The third is **fine-tuning**. This involves adjusting the model's behavior using proprietary data. It is effective when consistency in response tone or accurate use of industry-specific terminology is required, but it necessitates preparing training data and incurring GPU costs. Which route to choose depends on "what problem you are trying to solve," and cases where all three are used in combination are becoming increasingly common.


A local LLM refers to an operational model in which a large language model is run directly on one's own server or PC, without going through a cloud API.

SLM (Small Language Model) is a general term for language models with a parameter count limited to approximately a few billion to ten billion, characterized by the ability to perform inference and fine-tuning with fewer computational resources compared to LLMs.

LoRA (Low-Rank Adaptation) is a technique that inserts low-rank delta matrices into the weight matrices of large language models and trains only those deltas, enabling fine-tuning by adding approximately 0.1–1% of the total model parameters.

Local LLM / SLM Deployment Comparison — AI Utilization Without Cloud API Dependency

Prompt engineering is the practice of designing the structure, phrasing, and context of input text (prompts) in order to elicit desired outputs from LLMs (Large Language Models).