Inference-time Scaling (Test-time Compute)

Inference-time Scaling (Test-time Compute)

Inference-time scaling is a technique that dynamically increases or decreases the amount of computation used during a model's inference phase, allocating more "thinking steps" to difficult problems while providing immediate answers to simpler ones.

Scale Up Training, or Extend Inference?

Traditional LLM performance improvements have centered on "training-time scaling": more data, larger models, longer training runs. The evolution from GPT-3 to GPT-4 is a prime example of this approach.

Inference-time scaling operates on a different premise. Rather than increasing model size, it varies the amount of computation used at inference time based on the difficulty of the problem. "What's the weather today?" gets answered in one step, while "verify this mathematical proof" triggers dozens of steps of internal reasoning. OpenAI's o1/o3 and Anthropic's Claude with extended thinking both adopt this approach.

How It Works

The model internally generates "thinking tokens," explicitly unfolding the reasoning process before arriving at a final answer. The key difference from externally prompting Chain-of-Thought (CoT) is that the model itself generates long reasoning chains as needed, without external instruction.

Methods for controlling the compute budget vary by model. Options include setting an upper limit on token count, cutting off reasoning once confidence exceeds a threshold, or running multiple reasoning paths in parallel and taking a majority vote (Best-of-N).

Why It's Attracting Attention

Training-time scaling faces a "data wall" and a "cost wall." High-quality training data is finite, and doubling a model's size costs far more than simply twice as much. Inference-time scaling, by contrast, resembles a pay-as-you-go model where costs are incurred only when needed. In production environments where the majority of queries are straightforward, this approach allows average costs to be kept low while improving the ability to handle difficult problems.

As of 2026, "hybrid scaling"—combining both training-time and inference-time scaling—is becoming the mainstream approach.