A Sparse Model is a general term for neural network architectures that activate only a subset of the model's parameters during inference, rather than all of them. A representative example is MoE (Mixture of Experts), which adopts a scaling strategy distinct from that of Dense Models — increasing the total parameter count while keeping inference costs low.
## The Meaning of "Sparsity" In the context of neural networks, "Sparse" refers to a state in which only a small fraction of the connections or parameters within a network are actually used. While a Dense Model uses all parameters in its computations regardless of the input, a Sparse Model activates only a different subset of parameters for each input. An intuitive way to understand this mechanism is to imagine a large library. A Dense Model is like a librarian who re-reads the entire collection for every question, while a Sparse Model is like a librarian who consults only the relevant shelves depending on the question. ## Relationship with MoE The dominant form of Sparse Models today is the MoE architecture. In MoE, a router assigns each input token to a small number of experts (typically 2–4), and the experts that are not selected skip computation entirely. However, Sparse Models are not limited to MoE. "Unstructured sparsity," which sets the majority of weights to zero, and techniques that dynamically disable specific attention heads also fall within the category of sparse models. MoE is simply the most practically advanced form among them. ## Criteria for Choosing Between Sparse and Dense Models The advantages of Sparse Models are clear: they allow a model to hold more "knowledge" at the same inference cost. Mixtral 8x7B has a total of 46.7B parameters but only 12.9B active parameters, meaning its inference cost is equivalent to a 13B-class Dense Model while its performance approaches that of a 70B-class model. On the other hand, there are also challenges. Designing effective load balancing among experts is difficult, and when inputs concentrate on specific experts, the benefits of sparsity diminish. Furthermore, all experts must be loaded into GPU memory, making memory efficiency less straightforward than with Dense Models.


A Dense Model is a neural network architecture in which all of the model's parameters are used for computation during inference. In contrast to MoE (Mixture of Experts), which activates only a subset of experts, a Dense Model always involves all weights in computation regardless of the input.

MoE (Mixture of Experts) is an architecture that contains multiple "expert" subnetworks within a model, activating only a subset of them for each input, thereby increasing the total number of parameters while keeping inference costs low.

SLM (Small Language Model) is a general term for language models with a parameter count limited to approximately a few billion to ten billion, characterized by the ability to perform inference and fine-tuning with fewer computational resources compared to LLMs.
