Scaling Laws and AGI Timelines

Every few months someone plots a loss curve, extends the line to the right, and declares a date for artificial general intelligence. The line always looks convincing. It is also the wrong object to extrapolate. Scaling laws are some of the most reliable empirical results in machine learning — and almost everything people infer from them about AGI timelines is a category error.

This is a piece for engineers who have to make build-or-wait decisions under that uncertainty. If your AI implementation roadmap assumes a capability cliff in eighteen months, you are betting your architecture on a misreading of what the curves measure.

What the laws actually claim#

The modern story starts with Kaplan and collaborators at OpenAI in 2020. Their scaling laws paper showed that test loss falls as a smooth power law in three quantities — parameters, dataset size, and compute — across many orders of magnitude. No phase changes, no surprises: more of any input buys you a predictable decrement in loss.

Two years later DeepMind sharpened the picture. The Chinchilla paper by Hoffmann et al. fit the loss surface as L(N, D) = A·N^(-α) + B·D^(-β) + E and asked a sharper question: given a fixed compute budget, how should you split it between model size N and training tokens D? The answer overturned prevailing practice. Most large models of the era were badly undertrained. GPT-3 saw roughly 1.7 tokens per parameter; the compute-optimal ratio is closer to 20. To prove it, DeepMind trained a 70B model, Chinchilla, on 1.4T tokens — same compute as the 280B Gopher — and it won across the board, including a more than seven-point jump on MMLU.

That result has held up under scrutiny. Epoch AI’s replication attempt found errors in the original paper’s confidence intervals but confirmed the central finding: parameters and data should scale in roughly equal proportion. This is the single most actionable fact in the literature. It is also the one most often ignored when people talk about AGI.

It is worth being precise about what kind of claim this is. The power-law form is empirical, not derived from first principles. Nobody has a theory that predicts the exponents α and β — they are fit to data, and they hold across many orders of magnitude with unsettling regularity. That regularity is what makes the curves so seductive to extrapolate. But an empirical fit is only valid inside the regime where it was measured. Push far enough outside it — to compute budgets nobody has yet spent, or architectures nobody has yet trained — and you are no longer reading the law. You are guessing that the law still applies, which is a different and much weaker thing to assert.

A researcher's whiteboard covered in loss curves and scaling equations

Loss is not capability#

Here is the move that breaks every timeline argument. Scaling laws predict loss — the average surprise of the model on held-out text. They say nothing direct about whether the model can prove a theorem, debug a race condition, or run a multi-step agent without losing the thread.

The relationship between loss and downstream capability is not a law. It is a tangle of thresholds, emergent jumps, and benchmark-specific quirks. A model can shave another tenth of a nat off its cross-entropy and gain nothing you can sell, or it can cross some representational threshold and suddenly do arithmetic it could not do at the previous checkpoint. We have decent theories for why loss falls. We have almost no predictive theory for when a given capability appears as loss falls.

This matters for anyone shipping AI into production. A Hospital Management System that needs reliable medication-interaction reasoning does not care about perplexity. It cares about a specific behavior crossing a reliability bar. The scaling laws will not tell you the date that happens. They will tell you, at best, that throwing more compute at the base model is not obviously wasted — which is a much weaker claim than the timelines imply.

The emergence debate makes this worse before it makes it better. Some researchers argue that the apparent “emergent” capability jumps are partly an artifact of the metrics we choose — switch from a harsh all-or-nothing score to a smoother one and the cliff softens into a ramp. Others hold that genuine discontinuities exist. Both camps agree on the operational consequence: you cannot read off, from a loss number alone, whether the next training run crosses a threshold your product depends on. The mapping is empirical, discovered after the fact, one capability at a time. That is the opposite of a timeline.

The honest framing: extrapolating a loss curve gives you a confident prediction about a quantity nobody actually wants, and a vague hand-wave about the quantity everybody does.

The data wall is the real constraint#

If you want a timeline-relevant fact from the scaling literature, it is not the slope of the loss curve. It is the supply of training data.

Chinchilla-optimal scaling has an uncomfortable implication: to keep riding the compute curve, you need data and parameters to grow together. But high-quality human text is finite. Epoch AI’s analysis, Will we run out of data?, projects that the stock of human-generated public text will be fully used somewhere between 2026 and 2032, depending on how aggressively labs overtrain. Their earlier, gloomier estimate has slipped later, but the wall is real and it is close.

This is why the compute-versus-data tension is the most load-bearing part of any serious timeline. Compute has been growing at roughly 4x per year. Data has not. When the two decouple, compute-optimal scaling stops being free — you are forced into the overtrained regime, paying more compute for less marginal data, which is exactly the inefficiency Chinchilla warned against. Epoch’s Can AI scaling continue through 2030? lays out the binding constraints: power, chips, data, and latency, each with its own ceiling.

Three escape routes are on the table, and engineers should treat all three as unproven rather than inevitable:

Synthetic data. Models generating their own training corpus. It works in narrow, verifiable domains — code that compiles, math that checks — and risks model collapse where it does not. The verifier is the bottleneck, not the generator.

Multimodal data. Video and images expand the effective token stock by orders of magnitude, but it is unclear how much that helps text reasoning specifically.

Better sample efficiency. Squeezing more capability per token. This is where the real research frontier sits, and it is precisely the part that scaling laws, by construction, do not predict.

Why the curves keep bending#

There is a deeper reason to distrust straight-line extrapolation: the laws themselves are not fixed. Every architectural change — better optimizers, mixture-of-experts routing, improved data curation — shifts the constant E and the exponents in the loss equation. The curve you measured last year was for last year’s recipe.

This cuts both ways. Pessimists who say “scaling is dead” because one axis saturated are ignoring that the field routinely discovers a new axis to scale. Optimists who draw a clean line to AGI are ignoring that the axis they are extrapolating may not be the one that matters. The most defensible position is that loss will keep falling along whatever frontier is cheapest to push, and that the mapping from that frontier to general capability remains the open problem.

A high-performance AI accelerator board with heat sink fins on a workbench

What this means for build decisions#

Strip away the AGI framing and the scaling laws give you a few genuinely useful operating principles.

First, do not over-index on parameter count. A well-trained smaller model in the Chinchilla regime will beat a bloated, undertrained one at equal cost — and it is cheaper to serve. For most Data Platforms and Operational Automation workloads, the right model is smaller than the marketing suggests.

Second, treat capability as something you measure, not forecast. Build an internal eval suite that tracks the specific behaviors your product depends on — a School ERP needs reliable structured extraction far more than it needs another point of MMLU. Re-run it on every new release. The release cadence of frontier models is now your real timeline, not any extrapolated curve.

Third, architect for substitution. The single safest bet the scaling literature supports is that capability per dollar keeps improving along some axis. Systems that can swap the underlying model without a rewrite capture that improvement for free. Systems welded to one model’s quirks pay to migrate every time the frontier moves.

The scaling laws are real, durable, and useful. They are a guide to spending a compute budget efficiently, not a calendar for the arrival of general intelligence. The labs that built the curves know the difference. The forecasts that go viral usually do not. Extrapolating a loss curve to a date for AGI is like extrapolating Moore’s law to predict the year computers become conscious — the trend is genuine, the inference is not.

Building on frontier models and need an eval-first architecture that survives the next release? Talk to our AI engineering team.

What the laws actually claim#

Loss is not capability#

The data wall is the real constraint#

Why the curves keep bending#

What this means for build decisions#

Related posts.

Measuring AGI: ARC-AGI and the Benchmarks That Actually Matter

Elon Musk's AI Strategy, Read Through the Compute Instead of the Headlines

Claude Fable 5 and Mythos 5: A Frontier Model Launched, Then Pulled in 72 Hours