Test-Time Compute: Why Reasoning Models Scale Differently

Inference-time scaling rewrites the cost model for production AI. How o-series, DeepSeek-R1 and thinking models trade compute for accuracy.

Test-Time Compute: Why Reasoning Models Scale Differently

For most of the deep learning era, the deal was simple: you paid for intelligence once, at training time, and inference was cheap and roughly constant. A bigger model cost more to train and more to serve, but each query was a fixed, predictable amount of compute. Reasoning models broke that deal. They scale a different axis — how much the model thinks at inference — and that single change rewrites the cost model for anything you put in production.

This is a piece about that shift and its bill. If your AI implementation budget assumes per-query cost is a constant, reasoning models will surprise you, and not always pleasantly.

Two places to spend compute#

There have always been two distinct budgets in a model’s life. Training compute is paid once and amortized across every future query. Inference compute is paid per request, forever. The old scaling laws — Chinchilla and its lineage — were almost entirely about the first budget: how to spend training FLOPs to get the lowest loss.

Reasoning models move the lever to the second budget. Instead of producing an answer in one forward pass, they generate a long internal chain of thought — exploring, backtracking, checking — before committing. More thinking tokens means more inference compute means, on hard problems, better answers. You are no longer buying capability solely at training time. You are buying it again, per query, at the moment of asking.

The clean theoretical statement of this is Snell and colleagues’ Scaling LLM Test-Time Compute Optimally. Their headline result is sharp: on problems where a smaller base model already has a non-trivial success rate, optimally allocated test-time compute can outperform a model 14x larger in a FLOPs-matched comparison. Read that carefully — it says you can sometimes substitute inference compute for parameters. That is a genuinely different scaling regime, and it is why a 70B reasoning model can beat a much larger non-reasoning one on math while being cheaper to train.

The substitution has a hard limit, though, and the same paper draws it. The trade only works when the base model is already in the neighborhood of the answer — when it has a non-trivial chance of getting there with more attempts or more deliberation. On problems genuinely beyond the model’s reach, no amount of inference compute conjures the capability; you are sampling harder from a distribution that does not contain the solution. Test-time compute amplifies latent ability. It does not create ability that was never there. That boundary is exactly where the production decisions live: it tells you which workloads a cheaper reasoning model can rescue and which ones still demand a bigger base.

The model families that proved it#

Three lineages turned this from a paper into a product category.

OpenAI’s o-series. o1 was the first widely deployed model trained to use extended chains of thought, with accuracy that rises as you let it think longer. o3 pushed the same lever harder — more reinforcement learning on reasoning traces, plus search over candidate solutions at inference. The o3 ARC-AGI result is the canonical demonstration: per the ARC Prize writeup, a high-compute configuration hit 87.5% where prior models were stuck in single digits. The catch is the cost — the low-compute runs alone were on the order of $17 to $20 per task. Capability bought at inference time has a per-question price tag, and here it was steep.

DeepSeek-R1. The DeepSeek-R1 release showed that this behavior can emerge from pure reinforcement learning, without elaborate supervised scaffolding, reaching performance on par with o1. Just as important for engineers: R1 exposes its full chain of thought. You can read the false starts, the self-corrections, the verification steps. For debugging a production reasoning pipeline, a visible trace is the difference between an auditable system and a black box — and for a regulated Hospital Management System, that auditability is not optional.

Open thinking models and s1. The s1 paper demonstrated “simple test-time scaling” — you can extend or truncate a model’s thinking by literally appending or cutting “wait” tokens, forcing more deliberation with minimal machinery. It is a reminder that a lot of the gain is mechanically simple: let the model keep going, and on the right problems it keeps improving.

More thinking is not always better#

Here is the part the marketing skips. Test-time compute does not monotonically improve accuracy. Past a point, on the wrong problems, it actively hurts.

Two failure modes are well documented. The first is overthinking: on easy questions, reasoning models burn enormous token budgets second-guessing a correct early answer, and sometimes talk themselves out of it. The second is underthinking, characterized in Thoughts Are All Over the Place — the model flits between half-developed lines of reasoning, abandoning promising paths before finishing any of them. Either way, more inference compute buys you a worse answer and a bigger bill.

This breaks the comfortable intuition that thinking longer is strictly safer. It is not. The compute-optimal strategy is adaptive — spend tokens in proportion to actual difficulty — which is exactly what Snell et al. found and exactly what naive “always think hard” deployment gets wrong. A reasoning model pointed at a stream of trivial classification tasks is a way to set money on fire.

There is also a reliability cost that does not show up on a benchmark. A long chain of thought is a long chain of opportunities to go wrong. Each step the model takes is another place it can latch onto a bad assumption and then rationalize toward it, producing an answer that is confidently and elaborately incorrect. A short answer that is wrong is easy to catch; a wrong answer wrapped in two thousand tokens of plausible reasoning is not. For workloads where a confident-but-wrong output is worse than an honest “I am not sure” — clinical decision support, financial reconciliation — the length of the reasoning trace is itself a risk surface you have to bound, not just a cost to pay.

What this does to your production cost model#

For teams running this in anger, the implications are concrete and mostly about money and latency.

Cost variance, not just cost. A reasoning model’s per-query cost is a distribution, not a number. A hard query might emit ten or fifty times the tokens of an easy one. Your unit economics now depend on the difficulty mix of incoming traffic, which is much harder to forecast than a flat per-call rate. Capacity planning that assumes constant per-request cost will be wrong in both directions.

Latency becomes a product decision. Thinking takes wall-clock time. A model that reasons for thirty seconds is unusable in an interactive loop and fine in a batch pipeline. Most reasoning APIs now expose a thinking-effort control, and choosing it is a product call, not an infra detail. A School ERP generating overnight analytics can spend the compute; the same ERP’s autocomplete cannot.

Route, don’t default. The single biggest lever is not using a reasoning model for everything. The right architecture is a router: a cheap, fast model handles the bulk of easy traffic, and only genuinely hard queries escalate to the expensive reasoning path. This is core Operational Automation work — classify difficulty, route accordingly, cap the thinking budget per tier. Get it right and you capture the accuracy gains on the queries that need them while paying flat-rate cost on the ones that do not.

Where this leaves the architecture#

Test-time compute changes how you think about model choice at a structural level. The question is no longer just “which model is smartest.” It is “which model is smartest per dollar at the latency my product can tolerate” — and the answer depends on your traffic, because the same reasoning model is a bargain on hard queries and a ripoff on easy ones.

Two principles survive contact with production. First, instrument token consumption per query from day one. You cannot manage cost variance you cannot see, and reasoning models make per-query tokens the metric that actually drives your bill. Pipe it into the same dashboards you use for the rest of your Data Platforms. Second, build the difficulty router before you need it. Retrofitting routing onto a system that defaults every call to a reasoning model is painful; designing for tiered escalation from the start is cheap.

The deeper shift is conceptual. We spent a decade treating inference as a solved, constant-cost step at the end of the pipeline. Reasoning models reopened it as a place where capability is bought and sold, dynamically, per request. That is more powerful and more dangerous — it means a sloppy deployment can quietly 50x its own compute bill, and a careful one can match a much larger model’s quality at a fraction of the training cost. The teams that win the next eighteen months are the ones treating inference compute as a budget to manage, not a footnote.


Running reasoning models in production and watching the bill climb? Let us design your routing and cost-control layer.