Inference-time scaling is a technique that dynamically increases or decreases the amount of computation used during a model's inference phase, allocating more "thinking steps" to difficult problems while providing immediate answers to simpler ones.
Traditional LLM performance improvements have centered on "training-time scaling": more data, larger models, longer training runs. The evolution from GPT-3 to GPT-4 is a prime example of this approach.
Inference-time scaling operates on a different premise. Rather than increasing model size, it varies the amount of computation used at inference time based on the difficulty of the problem. "What's the weather today?" gets answered in one step, while "verify this mathematical proof" triggers dozens of steps of internal reasoning. OpenAI's o1/o3 and Anthropic's Claude with extended thinking both adopt this approach.
The model internally generates "thinking tokens," explicitly unfolding the reasoning process before arriving at a final answer. The key difference from externally prompting Chain-of-Thought (CoT) is that the model itself generates long reasoning chains as needed, without external instruction.
Methods for controlling the compute budget vary by model. Options include setting an upper limit on token count, cutting off reasoning once confidence exceeds a threshold, or running multiple reasoning paths in parallel and taking a majority vote (Best-of-N).
Training-time scaling faces a "data wall" and a "cost wall." High-quality training data is finite, and doubling a model's size costs far more than simply twice as much. Inference-time scaling, by contrast, resembles a pay-as-you-go model where costs are incurred only when needed. In production environments where the majority of queries are straightforward, this approach allows average costs to be kept low while improving the ability to handle difficult problems.
As of 2026, "hybrid scaling"—combining both training-time and inference-time scaling—is becoming the mainstream approach.


An optimization technique that compresses model size by reducing parameter precision from 16-bit to 4-bit or similar, enabling inference with limited computational resources.

A inference acceleration technique in which a small draft model proposes multiple tokens speculatively in advance, and a large model verifies them in parallel.

A memory compression technology for LLMs developed by Google. It reduces memory consumption by up to 1/6 through quantization and accelerates inference speed by up to 8 times.

What is Human-in-the-Loop (HITL)? The Basics of "Human Participation" Design for Establishing AI-Driven Business Process Automation

Fine-tuning refers to the process of providing additional training data to a pre-trained machine learning model in order to adapt it to a specific task or domain.