MoE (Mixture of Experts)

MoE (Mixture of Experts)

MoE (Mixture of Experts) is an architecture that contains multiple "expert" subnetworks within a model, activating only a subset of them for each input, thereby increasing the total number of parameters while keeping inference costs low.

Fast Because It Doesn't Use Everything

Why can massive models like GPT-4 and Llama 4 perform inference at relatively practical speeds? One answer is the MoE architecture.

In a standard Transformer model (Dense model), every input token passes through all parameters. For a 100B-parameter model, 100B weights are involved in every calculation. With MoE, even if the total model size is 2 trillion parameters, only around 170B are actually used in a single inference pass — the remaining experts are skipped as "not needed this time."

The component that decides which experts to use is the "gating network" (router). It examines the features of the input token and selects the 2–4 most appropriate experts. A useful mental model: logic-oriented experts are selected for math problems, while language-oriented experts are chosen for translation tasks.

Models Currently in Use

Meta's Llama 4 adopts this architecture in Scout (17B active / 109B total) and Maverick (17B active / 400B total). Google's Gemini series is also reported to be MoE-based. Mistral's Mixtral 8x7B bundles eight 7B-parameter experts together, using only two of them during inference.

What these models share is that the active parameters during inference are dramatically fewer than the total parameter count. This allows them to maintain the model's knowledge capacity while keeping inference speed and cost within practical bounds.

When to Use Dense Models vs. MoE

Dense models, which use all parameters, are simple and easy to work with at small to medium scales. Fine-tuning is also straightforward. MoE is an architecture that truly shines at large scale, and for models below several tens of billions of parameters, the overhead may not be worth it.

Additionally, fine-tuning MoE models requires care to avoid unintended effects across all experts, and combining them with PEFT methods such as LoRA demands its own set of know-how.