Speculative Decoding

A inference acceleration technique in which a small draft model proposes multiple tokens speculatively in advance, and a large model verifies them in parallel.
What is Speculative Decoding?
Speculative Decoding is a technique that accelerates inference speed by 2–3× by having a small "draft model" propose multiple tokens in advance, while a large "verification model" validates and accepts or rejects them in parallel.
Overview of the Mechanism
Standard LLM inference generates tokens one at a time sequentially, meaning the larger the model, the greater the computational cost per step and the slower the response. Speculative Decoding alleviates this sequential bottleneck.
- The draft model (small and fast) generates several tokens ahead all at once
- The verification model (large and high-accuracy) validates the proposed token sequence in a single pass
- Tokens that pass verification are accepted as-is; from the first rejected token onward, the verification model regenerates
The higher the probability that the draft model's proposals are correct, the fewer times the verification model needs to be called, and the greater the speedup.
Impact on Output Quality
An important point is that Speculative Decoding does not alter the output distribution of the verification model. Mathematically, it produces identical output to running the verification model without a draft model, meaning speed is improved without any sacrifice in quality.
Suitable Use Cases
This technique is particularly effective in scenarios where low latency is desired while maintaining the high accuracy of a large model — such as real-time chatbot responses and code completion. Since it also leads to reduced GPU costs, it is a technique worth considering for production systems where inference cost is a concern.
Related Terms

AI ROI (Return on Investment in AI)
AI ROI is a metric that quantitatively measures the effects obtained — such as operational efficienc

AI Observability
An operational practice of continuously monitoring and visualizing the inputs/outputs, latency, cost

Ambient AI
Ambient AI refers to an AI system that is seamlessly embedded in the user's environment, continuousl

BPO (Business Process Outsourcing)
BPO refers to a form of outsourcing in which a company delegates specific business processes to an e