MoE (Mixture of Experts)

MoE (Mixture of Experts) is an architecture that contains multiple "expert" subnetworks within a model, activating only a subset of them for each input, thereby increasing the total number of parameters while keeping inference costs low.
Fast Because It Doesn't Use Everything
Why can massive models like GPT-4 and Llama 4 perform inference at relatively practical speeds? One answer is the MoE architecture.
In a standard Transformer model (Dense model), every input token passes through all parameters. For a 100B-parameter model, 100B weights are involved in every calculation. With MoE, even if the total model size is 2 trillion parameters, only around 170B are actually used in a single inference pass — the remaining experts are skipped as "not needed this time."
The component that decides which experts to use is the "gating network" (router). It examines the features of the input token and selects the 2–4 most appropriate experts. A useful mental model: logic-oriented experts are selected for math problems, while language-oriented experts are chosen for translation tasks.
Models Currently in Use
Meta's Llama 4 adopts this architecture in Scout (17B active / 109B total) and Maverick (17B active / 400B total). Google's Gemini series is also reported to be MoE-based. Mistral's Mixtral 8x7B bundles eight 7B-parameter experts together, using only two of them during inference.
What these models share is that the active parameters during inference are dramatically fewer than the total parameter count. This allows them to maintain the model's knowledge capacity while keeping inference speed and cost within practical bounds.
When to Use Dense Models vs. MoE
Dense models, which use all parameters, are simple and easy to work with at small to medium scales. Fine-tuning is also straightforward. MoE is an architecture that truly shines at large scale, and for models below several tens of billions of parameters, the overhead may not be worth it.
Additionally, fine-tuning MoE models requires care to avoid unintended effects across all experts, and combining them with PEFT methods such as LoRA demands its own set of know-how.
Related Terms

AI ROI (Return on Investment in AI)
AI ROI is a metric that quantitatively measures the effects obtained — such as operational efficienc

AI Observability
An operational practice of continuously monitoring and visualizing the inputs/outputs, latency, cost

Ambient AI
Ambient AI refers to an AI system that is seamlessly embedded in the user's environment, continuousl

BPO (Business Process Outsourcing)
BPO refers to a form of outsourcing in which a company delegates specific business processes to an e