A inference acceleration technique in which a small draft model proposes multiple tokens speculatively in advance, and a large model verifies them in parallel.
Speculative Decoding is a technique that accelerates inference speed by 2–3× by having a small "draft model" propose multiple tokens in advance, while a large "verification model" validates and accepts or rejects them in parallel.
Standard LLM inference generates tokens one at a time sequentially, meaning the larger the model, the greater the computational cost per step and the slower the response. Speculative Decoding alleviates this sequential bottleneck.
The higher the probability that the draft model's proposals are correct, the fewer times the verification model needs to be called, and the greater the speedup.
An important point is that Speculative Decoding does not alter the output distribution of the verification model. Mathematically, it produces identical output to running the verification model without a draft model, meaning speed is improved without any sacrifice in quality.
This technique is particularly effective in scenarios where low latency is desired while maintaining the high accuracy of a large model — such as real-time chatbot responses and code completion. Since it also leads to reduced GPU costs, it is a technique worth considering for production systems where inference cost is a concern.


A Sparse Model is a general term for neural network architectures that activate only a subset of the model's parameters during inference, rather than all of them. A representative example is MoE (Mixture of Experts), which adopts a scaling strategy distinct from that of Dense Models — increasing the total parameter count while keeping inference costs low.

An optimization technique that compresses model size by reducing parameter precision from 16-bit to 4-bit or similar, enabling inference with limited computational resources.

A prompting technique that improves accuracy on complex tasks by having the LLM explicitly generate intermediate reasoning steps.
