Speculative Decoding

Speculative Decoding

A inference acceleration technique in which a small draft model proposes multiple tokens speculatively in advance, and a large model verifies them in parallel.

What is Speculative Decoding?

Speculative Decoding is a technique that accelerates inference speed by 2–3× by having a small "draft model" propose multiple tokens in advance, while a large "verification model" validates and accepts or rejects them in parallel.

Overview of the Mechanism

Standard LLM inference generates tokens one at a time sequentially, meaning the larger the model, the greater the computational cost per step and the slower the response. Speculative Decoding alleviates this sequential bottleneck.

  1. The draft model (small and fast) generates several tokens ahead all at once
  2. The verification model (large and high-accuracy) validates the proposed token sequence in a single pass
  3. Tokens that pass verification are accepted as-is; from the first rejected token onward, the verification model regenerates

The higher the probability that the draft model's proposals are correct, the fewer times the verification model needs to be called, and the greater the speedup.

Impact on Output Quality

An important point is that Speculative Decoding does not alter the output distribution of the verification model. Mathematically, it produces identical output to running the verification model without a draft model, meaning speed is improved without any sacrifice in quality.

Suitable Use Cases

This technique is particularly effective in scenarios where low latency is desired while maintaining the high accuracy of a large model — such as real-time chatbot responses and code completion. Since it also leads to reduced GPU costs, it is a technique worth considering for production systems where inference cost is a concern.