Inference-time Scaling (Test-time Compute)

Inference-time scaling is a technique that dynamically increases or decreases the amount of computation used during a model's inference phase, allocating more "thinking steps" to difficult problems while providing immediate answers to simpler ones.
Scale Up Training, or Extend Inference?
Traditional LLM performance improvements have centered on "training-time scaling": more data, larger models, longer training runs. The evolution from GPT-3 to GPT-4 is a prime example of this approach.
Inference-time scaling operates on a different premise. Rather than increasing model size, it varies the amount of computation used at inference time based on the difficulty of the problem. "What's the weather today?" gets answered in one step, while "verify this mathematical proof" triggers dozens of steps of internal reasoning. OpenAI's o1/o3 and Anthropic's Claude with extended thinking both adopt this approach.
How It Works
The model internally generates "thinking tokens," explicitly unfolding the reasoning process before arriving at a final answer. The key difference from externally prompting Chain-of-Thought (CoT) is that the model itself generates long reasoning chains as needed, without external instruction.
Methods for controlling the compute budget vary by model. Options include setting an upper limit on token count, cutting off reasoning once confidence exceeds a threshold, or running multiple reasoning paths in parallel and taking a majority vote (Best-of-N).
Why It's Attracting Attention
Training-time scaling faces a "data wall" and a "cost wall." High-quality training data is finite, and doubling a model's size costs far more than simply twice as much. Inference-time scaling, by contrast, resembles a pay-as-you-go model where costs are incurred only when needed. In production environments where the majority of queries are straightforward, this approach allows average costs to be kept low while improving the ability to handle difficult problems.
As of 2026, "hybrid scaling"—combining both training-time and inference-time scaling—is becoming the mainstream approach.
Related Terms

AI ROI (Return on Investment in AI)
AI ROI is a metric that quantitatively measures the effects obtained — such as operational efficienc

AI Observability
An operational practice of continuously monitoring and visualizing the inputs/outputs, latency, cost

Ambient AI
Ambient AI refers to an AI system that is seamlessly embedded in the user's environment, continuousl

BPO (Business Process Outsourcing)
BPO refers to a form of outsourcing in which a company delegates specific business processes to an e