ARC-AGI and Benchmarks That Matter

A benchmark is a claim about what intelligence is, smuggled in as a number. When a lab reports 90 on some eval, they are not just reporting a measurement — they are asserting that the thing the eval tests is worth caring about. The history of AI benchmarks is mostly the history of that assertion turning out to be wrong, and a new benchmark being built to fix it.

If you run AI in production, this is not academic. The eval you trust decides which model you ship. Pick a saturated one and every frontier model looks identical. Pick a contaminated one and you are measuring memorization. This is a field guide to the benchmarks that still discriminate in 2026, and why measuring generality remains genuinely hard.

The saturation treadmill#

The pattern repeats with grim regularity. A benchmark launches, frontier models score near random, the community calls it hard, and within a couple of years models cluster against the ceiling and the benchmark dies. MMLU, HumanEval, and MBPP are all there now — frontier models sit so close to the top that the benchmark can no longer tell two of them apart. A saturated benchmark is not solved; it is exhausted. It has lost the ability to discriminate, which is the only thing a benchmark is for.

This is why the action has moved to a harder cohort: MMLU-Pro, GPQA-Diamond, ARC-AGI-2, Humanity’s Last Exam, FrontierMath, and SWE-bench. Each was built specifically because its predecessor stopped separating the top of the pack. The treadmill is not a bug in how we evaluate — it is the cost of measuring a moving target.

There is a subtler failure hiding inside saturation, too. As scores climb toward the ceiling, the remaining questions a benchmark gets wrong are often not the hard ones — they are the broken ones: mislabeled answers, ambiguous phrasing, items with no defensible ground truth. Past about 90%, a benchmark frequently stops measuring capability and starts measuring a model’s willingness to agree with the test’s own errors. Two models a point apart at the top of a saturated leaderboard may differ only in how they handle the benchmark’s mistakes, which is not a signal anyone should ship on.

GPQA: hard for humans on purpose#

GPQA, from David Rein and colleagues, is a clean example of building difficulty deliberately. It is 448 graduate-level questions in biology, physics, and chemistry, written by domain experts and filtered to be “Google-proof.” The validation is what makes it credible: PhD-level experts in the matching domain reach about 65% accuracy, while skilled non-experts with unrestricted web access and over thirty minutes per question manage only 34%. The gap is the signal. When a model clears the non-expert band and pushes into expert territory, you are watching something that is not retrievable by search. The Diamond subset is the hardest, cleanest slice, and it is one of the few knowledge evals still reported on frontier model cards.

A grid of colored abstract puzzle tiles forming a geometric pattern

ARC-AGI: the one built against memorization#

Most benchmarks test knowledge. François Chollet’s ARC-AGI tests something deliberately different: the ability to infer a novel transformation rule from a handful of examples and apply it to an unseen grid. Every task is designed to be resistant to memorization — the rule is new, so there is no answer to recall. This is the closest the field has to a direct probe of fluid, on-the-fly reasoning rather than stored pattern matching.

For years it barely moved. ARC-AGI-1 went from 0% with GPT-3 in 2020 to roughly 5% with GPT-4o in 2024. Then in December 2024 OpenAI’s o3 posted a step change: per the ARC Prize writeup, a high-compute configuration hit 87.5%, with the low-compute setting around 75.7%. That was the moment the “LLMs can only interpolate” thesis took real damage. It was also expensive — the low-compute runs alone cost on the order of $17 to $20 per task, a detail that matters enormously and that the headlines dropped.

The response was ARC-AGI-2, and the response to that tells you how brittle these gains are. When ARC-AGI-2 launched in March 2025, every frontier model scored 0%. The ARC Prize 2025 analysis reports that all three families of approach — program synthesis, neuro-symbolic, and pure neural — took a 2-3x performance drop moving from v1 to v2. A benchmark that goes from “solved” to “zero” with a single design iteration is telling you that the previous score measured something narrower than general reasoning. By late 2025 the strongest verified entries had climbed back into the 30-50% range, but at per-task costs that make the economics, not the capability, the binding question.

Contamination: when the test leaks into training#

Saturation is the visible failure mode. Contamination is the invisible one, and it is worse. A benchmark is contaminated when its questions — or text closely derived from them — end up in the training corpus, so the model recalls the answer instead of reasoning to it. You cannot see this from the score. A contaminated 95 and an earned 95 look identical on a leaderboard.

The scale of the problem became undeniable with SWE-bench. SWE-bench by Jimenez et al. is one of the best-designed evals we have — real GitHub issues from open-source Python repos, where the model must produce a patch that passes the project’s actual unit tests. It is grounded, verifiable, and close to a task engineers actually pay for. And it still leaked: audits found that frontier models could reproduce verbatim gold patches and problem-statement specifics for some Verified tasks. The maintainers responded by moving toward harder, fresher variants. That is the right reflex — the only durable defense against contamination is novelty, which means a benchmark’s useful life is bounded by how fast it leaks.

This is the uncomfortable core of evaluation. The better and more cited a benchmark becomes, the more its data spreads across the web, and the faster it contaminates the very models it is meant to test. Popularity is self-defeating.

Detecting contamination is its own arms race, and none of the methods are clean. You can hold out a private test set, but then nobody else can reproduce your numbers. You can check whether a model assigns suspiciously high probability to verbatim benchmark text, but modern training pipelines deduplicate and paraphrase, blurring the signal. You can build deliberately fresh variants — questions written after a model’s training cutoff — but that only buys you one cycle before the new set leaks too. The practical upshot is that any third-party benchmark score should be read with a discount that grows with the benchmark’s age and fame. A two-year-old, heavily cited eval is almost certainly partially memorized by every frontier model, whatever the leaderboard says.

A laboratory test bench with measurement instruments and waveforms

Why generality resists measurement#

Step back and the deeper problem comes into focus. Any single benchmark measures one slice of behavior. Generality — the thing “AGI” is supposed to name — is precisely the ability to perform on tasks not in your distribution. By definition, you cannot put it in a test set, because the moment it is in a test set it is in-distribution and someone optimizes against it.

This is why ARC-AGI is interesting and also why it is not a finish line. It probes one facet of fluid reasoning. Clearing it is necessary, not sufficient. The field has implicitly accepted this by moving to suites: HLE, FrontierMath, ARC-AGI-2, GPQA-Diamond, SWE-bench, agentic harnesses like τ-bench, and tool-use evals like BFCL. No one of them is generality. The portfolio, plus the rate at which models clear new entries without being trained on them, is the best proxy we have — and it is still a proxy.

What to actually do in production#

The lesson for teams shipping AI is not to chase the leaderboard. It is to build the eval the leaderboard cannot.

Build a private holdout. The single most reliable thing you can do is maintain an internal eval drawn from your own data and tasks, never published, never sent anywhere it can leak. For a Hospital Management System that is real de-identified clinical reasoning cases; for a School ERP it is your actual structured-extraction and scheduling tasks. A private holdout is contamination-proof by construction, and it measures the thing you actually sell.

Weight verifiable evals. GPQA-Diamond and SWE-bench correlate better with production performance than knowledge-recall benchmarks precisely because they demand a checkable output. Prefer evals where correctness is mechanically decidable.

Track cost per solved task, not just accuracy. The o3 ARC-AGI result is the whole argument in one data point: a number that is meaningless without its price. Any benchmark figure you act on should carry its compute cost, because in production the economics decide what ships.

Benchmarks are how the field argues about what intelligence is. Treat them as arguments, not measurements — read the methodology, check for leakage, and never let someone else’s test set stand in for your own.

Choosing a model and tired of leaderboards that don’t predict your workload? Let us build your private eval harness.

The saturation treadmill#

GPQA: hard for humans on purpose#

ARC-AGI: the one built against memorization#

Contamination: when the test leaks into training#

Why generality resists measurement#

What to actually do in production#

Related posts.

What the Scaling Laws Actually Say About AGI Timelines

Evaluation-Driven Development for LLM Apps: The TDD Equivalent for AI

Claude Fable 5 and Mythos 5: A Frontier Model Launched, Then Pulled in 72 Hours