SLM (Small Language Model) is a general term for language models with a parameter count limited to approximately a few billion to ten billion, characterized by the ability to perform inference and fine-tuning with fewer computational resources compared to LLMs.
## "Small" Does Not Mean Weak In the world of LLMs, "bigger means smarter" has long been conventional wisdom. Compared to GPT-4's estimated 1.8 trillion parameters, SLMs sit at around 1B–10B — a difference of two orders of magnitude. However, since 2025, this conventional wisdom has been rapidly crumbling. Microsoft's Phi-4 (14B) has achieved scores rivaling GPT-4o on several reasoning benchmarks. Google's Gemma 3, ranging from 1B to 27B, delivers extremely high performance per parameter for its size. Through improvements in model architecture and the curation of high-quality training data, "small but sufficient for specific tasks" has become a reality. ## Where Are They Being Used? SLMs have three primary battlegrounds. **Edge devices**: Environments with limited GPU resources, such as smartphones, IoT gateways, and embedded systems. Apple's on-device inference running on iPhones is a prime example of SLMs in action. **Cost optimization**: Using GPT-4-class models for routine tasks like classification, summarization, and data extraction is overkill. With SLMs, inference costs can be reduced to less than one-tenth. **Latency requirements**: Scenarios demanding responses in tens of milliseconds, such as real-time chat, voice response, and game AI. With fewer parameters, inference speed is faster by orders of magnitude. ## How to Use SLMs and LLMs Differently LLMs still hold the advantage when general-purpose responses are needed — complex reasoning, multilingual support, and long-form generation. On the other hand, when tasks can be narrowed down, fine-tuning an SLM can outperform in terms of accuracy, speed, and cost all at once. In practice, a standard workflow is emerging: "first prototype with an LLM API, then once the task is well-defined, distill it into an SLM to reduce costs." Distillation refers to the technique of training a smaller model using the outputs of a larger model as teacher data.


A2A (Agent-to-Agent Protocol) is a communication protocol that enables different AI agents to perform capability discovery, task delegation, and state synchronization, published by Google in April 2025.

Acceptance testing is a testing method that verifies whether developed features meet business requirements and user stories, from the perspective of the product owner and stakeholders.

Agent Skills are reusable instruction sets defined to enable AI agents to perform specific tasks or areas of expertise, functioning as modular units that extend the capabilities of an agent.

Local LLM / SLM Deployment Comparison — AI Utilization Without Cloud API Dependency

Agentic AI is a general term for AI systems that interpret goals and autonomously repeat the cycle of planning, executing, and verifying actions without requiring step-by-step human instruction.