The Year of the Eval

A field that cannot measure itself does not, in any meaningful sense, know whether it is progressing. The artificial-intelligence community’s principal mechanism for measuring itself — its evaluation infrastructure, the body of benchmarks against which models are scored and the methodology by which the scoring is conducted — has, in the twelve months between mid-2025 and mid-2026, undergone a structural transition that is more consequential than any single model release of the period. The transition is the move from a leaderboard model of evaluation, which dominated the field from roughly 2018 to 2024, through an intermediate agentic model, which dominated 2024 and 2025, to an emerging economic model that is, by mid-2026, the regime the major labs are reorganising their evaluation around.

This post is a chronicle of that twelve-month transition. It is not a corrective to the trajectory series of earlier this month or to the labor-displacement essay of last week; it is a different kind of post, treating one specific technical sub-discipline of the field across a specific twelve-month window. The argument it makes is that the eval situation the field has been managing for the past two years is the bigger story than any individual model release or any individual benchmark’s saturation, because eval is how the field knows — or fails to know — whether the capability gains the labs report are real.

The Leaderboard Inheritance

The evaluation infrastructure the field inherited from the pre-language-model era was, by any reasonable accounting, a leaderboard infrastructure. A benchmark was a fixed dataset of inputs paired with reference outputs; a model’s score was the proportion of inputs on which it produced a reference-matching output; and the field’s collective sense of progress was the proportion of leaderboards on which the latest model had pushed the state of the art forward.

The principal benchmarks of the period — MMLU (the Massive Multitask Language Understanding benchmark of 2020), HellaSwag (2019), the AI2 Reasoning Challenge (2018), TruthfulQA (2021), GSM8K (the eight-thousand grade-school math word problems of 2021), HumanEval (the OpenAI code-completion benchmark of 2021), the BIG-Bench suite (Google’s 2022 collection of two-hundred-plus narrow tasks) — were academic artefacts of considerable internal logic. Each was a multiple-choice or short-answer test of one or another cognitive ability that had previously been considered the property of human beings; each could be scored automatically against the reference answers; each provided the field with a leaderboard against which the major labs could compare their latest models.

The leaderboard regime worked, in any defensible sense, through the period 2018 to roughly 2023. Models advanced steadily up the benchmarks; the differences between models were largely commensurable; the field’s collective story about progress could be told in numbers that anyone in the conversation could verify. GPT-3.5 was approximately 70% on MMLU; GPT-4 was approximately 86%; the human baseline (carefully measured by domain experts) was approximately 90%. The Claude and Gemini series climbed similar curves. The story was legible, and the field told it without much controversy.

The trouble was that the regime ceased to work, in the way it had worked through 2023, at roughly the moment the better frontier models began to saturate the principal benchmarks. By the middle of 2024, GPT-4-class models and their Claude and Gemini equivalents were within a percentage point or two of the expert-human baseline on MMLU, were saturating HumanEval and GSM8K, were at the ceiling of HellaSwag, and were producing benchmark scores on the broader suites that the field’s old habits of comparison could no longer make sense of. A 0.4-percentage-point gain on MMLU between two model versions, when both were within striking distance of the human ceiling, was no longer the legible signal of progress it had been when the field was at the 70-percent mark.

A second and related problem was the contamination problem — the discovery, made progressively across 2023 and 2024 and substantially documented in the academic literature by the end of 2024, that the benchmark datasets had often leaked into the training data of the major models, with the result that high benchmark scores were sometimes partly an artefact of memorisation rather than reasoning. The field’s response, when it could be organised, was to produce held-out and verified versions of the benchmarks (the SWE-bench Verified set published by OpenAI in August 2024 being the prototype: a subset of SWE-bench problems that had been human-confirmed as both solvable and correctly specified, allowing controlled measurement against a clean reference). But the contamination concern, once raised, was hard to contain, and it eroded the field’s confidence in the leaderboard regime to a degree that the regime did not recover from.

By the end of 2024, the broader academic AI evaluation community had reached an effective consensus: the leaderboard model was no longer adequate for measuring frontier-model progress, and a different evaluation regime would be needed. The question was what regime, and the answer that emerged through 2024 and consolidated through 2025 was the agentic one.

The Agentic Turn

The agentic evaluation regime took as its principal innovation the move from single-turn answer-matching (the leaderboard regime) to multi-turn task completion in a realistic environment. Rather than asking a model to choose A, B, C, or D, the new benchmarks asked the model to do something — to fix a bug in a real GitHub repository, to navigate an operating system to accomplish a task, to browse a synthetic web environment to find a piece of information, to operate a tool chain to complete an end-to-end workflow.

The flagship benchmark of the agentic regime was SWE-bench, introduced in October 2023 and substantially refined as SWE-bench Verified in August 2024. The setup is concrete and easily described: the model is given the full source code of a real open-source Python repository at a specific commit, plus the text of a real GitHub issue describing a bug or a feature request, and is asked to produce a code patch that, when applied to the repository, causes the issue’s associated test suite to pass. The reference is the actual human-authored patch that resolved the issue in the actual repository history. The metric is the resolved rate: the percentage of issues for which the model’s patch passes the test suite. The 500-task SWE-bench Verified subset became, by the middle of 2025, the field’s most-cited single capability benchmark, and the trajectory of model scores on it became the field’s most-discussed progress story.

The numbers on SWE-bench Verified are instructive about the speed at which the agentic regime moved. Claude 3.5 Sonnet, on its release in June 2024, scored approximately 33% on SWE-bench Verified — already a substantial advance on the field’s prior state of the art (the original SWE-bench best-of-2023 numbers were in the low single digits). Claude 3.5 Sonnet’s October 2024 update brought it to approximately 49%. Claude 3.7 Sonnet of February 2025 reached approximately 63%. Claude 4 Opus of May 2025 was at approximately 72%. By the end of 2025, with the Claude 4.5 generation and the GPT-5 / o4 series, the frontier was over 80%, and by the middle of 2026 the verified subset was within a few percentage points of saturation — at which point the benchmark’s discriminative power began to fail in the same way the leaderboard benchmarks had failed before it.

The other agentic benchmarks of the period followed similar arcs. OSWorld — the operating-system-interaction benchmark introduced in April 2024 by a Hong Kong / Salesforce / Carnegie Mellon consortium, in which the model is given a virtual machine and asked to complete tasks like installing software, configuring settings, and manipulating files — moved from approximately 12% (Claude 3.5 Sonnet, mid-2024) to approximately 50%+ (the better Claude 4 generation, late 2025) over an eighteen-month period. GAIA (the General AI Assistant benchmark of November 2023) saturated even faster, moving from approximately 9% (Llama 2-class baseline) to over 70% across the same eighteen months. τ-bench (Anthropic’s tool-use-agent benchmark of June 2024), Cybench (the cybersecurity capture-the-flag benchmark), MLE-bench (the OpenAI ML-engineering benchmark of October 2024), BrowseComp (the web-browsing comprehension benchmark), the HLE (Humanity’s Last Exam, January 2025) — each was, in its own way, a discrete agentic benchmark that the major labs ran against each new model release and that the field followed as a signal of capability progress.

The agentic regime was, on its face, a substantial improvement on the leaderboard regime. It measured task completion in a realistic environment rather than answer-matching in an academic format. It was harder to contaminate (the model had to actually do the task, not retrieve a memorised answer). It produced numbers that legibly tracked the capabilities of interest to the buyer of an AI agent. And for a roughly eighteen-month period running from late 2024 through early 2026, the agentic regime did real work for the field as the consensus measure of frontier progress.

The trouble with the agentic regime — the same trouble the leaderboard regime had had before it, but on a substantially faster clock — was that the agentic benchmarks saturated faster than the leaderboard benchmarks before them. MMLU had taken roughly four years to go from its 2020 introduction to its 2024 ceiling; SWE-bench Verified took roughly twenty-two months from its August 2024 verified release to its mid-2026 saturation. OSWorld took roughly twenty-four months from its April 2024 introduction to a similar saturation point. The agentic regime did real work for the field, but it was a regime with a substantially shorter half-life than the leaderboard regime it replaced — and by mid-2026 the field was, for the second time in as many years, in a position of needing a new evaluation regime to replace one that had ceased to discriminate.

The Economic Pivot

The evaluation regime that has emerged in the past six months, and that is, by mid-2026, the dominant frame at the major labs for measuring frontier capability, is what one might call the economic regime. It takes as its principal innovation the move from task-completion-as-binary-success (the agentic regime) to task-completion-with-an-attached-economic-quantity (cost of compute, time horizon, dollar value of the output, success rate at a market wage, and a half-dozen related quantities). The shift is structural: the agentic regime measured whether the model could do the task; the economic regime measures how much it costs and how much it is worth.

The principal exemplar of the new regime is the work of METR (Model Evaluation and Threat Research), the independent evaluation organisation that has, since 2024, produced the most-cited body of work on what is now called the time-horizon of model capability. The METR methodology is concrete: for each model and each domain, METR measures the length of time it would take a competent human professional to complete tasks that the model itself can complete with 50% reliability. The result is expressed as a task-length in minutes or hours. Claude 3 Sonnet (early 2024) sat at approximately two minutes of task length; Claude 3.5 Sonnet (October 2024) at approximately five minutes; Claude 3.7 Sonnet at approximately fifteen minutes; Claude 4 Opus at approximately one hour; and the early-2026 frontier models (Claude 4.5, GPT-5, the equivalent Gemini-Ultra-class) are at approximately three to four hours of task length, with extrapolation curves suggesting that the doubling time of the metric is somewhere between four and seven months. By the end of 2026, on the extrapolation curve, the frontier models will be at roughly twelve to fifteen hours of task length — which is to say, will be reliably completing tasks that take a human professional more than a working day.

The METR work is the cleanest articulation of the economic regime’s core argument. It does not measure scores on benchmarks; it measures the time-equivalent value of model output, which is a quantity directly translatable to wage-replacement value. A model that can reliably complete one-hour tasks is one that can replace a fraction of an hour of a paid worker’s time per query; a model that can reliably complete one-day tasks is one that can replace a fraction of a day’s wages. The metric tracks what the buyer of an AI agent actually cares about, and it generalises across domains in a way that the agentic benchmarks did not.

Alongside METR, several other economic-regime evaluation tracks have emerged in the past twelve months. SWE-Lancer, an OpenAI benchmark introduced in early 2025, takes real Upwork software-engineering tasks (with the actual dollar amounts the human freelancers were paid) and measures the model’s success rate weighted by task value. REPLBench and the related AI in the Workplace studies measure model performance on the actual workflows of named professional categories (paralegal document review, junior associate contract markup, financial analyst pitch-book construction, code-review for a substantial open-source project). The Anthropic economic agent evaluation track, launched in late 2025, focuses specifically on the productivity ratio between AI-assisted and unassisted human professionals across a battery of measured tasks. Google DeepMind’s GDPval methodology, currently in pre-publication, attempts to express model capability as a fraction of US GDP that a model could in principle produce given infinite compute.

The economic regime is, in its current early-2026 form, considerably more fragmented than the leaderboard or agentic regimes were. There is no single dominant economic benchmark in the way SWE-bench Verified was the dominant agentic benchmark or MMLU was the dominant leaderboard benchmark. Different labs use different economic measures; different academic groups produce different economic-evaluation studies; the field’s collective story about progress is, by mid-2026, less legible than it was in either of the prior regimes. This fragmentation is partly a feature (the economic regime measures things the buyer actually cares about, and different buyers care about different things) and partly a bug (the field’s ability to compare models against each other has degraded substantially in the past year). The next twelve months will, on the present trajectory, see the field converge on one or two economic benchmarks as the dominant references — but as of mid-2026 the convergence has not yet happened.

The Speed Problem

There is a structural problem visible across the three regimes that deserves to be named directly. Each evaluation regime is saturating faster than its predecessor.

The leaderboard regime ran for roughly six years (2018 to 2024) before its principal benchmarks saturated. The agentic regime ran for roughly two years (2024 to 2026) before SWE-bench Verified and the comparable benchmarks approached saturation. The economic regime, on present trajectory, looks likely to run for roughly twelve to eighteen months before METR’s time-horizon metric approaches the practical ceiling at which it can no longer discriminate between frontier models (which will happen when models can reliably complete arbitrary-length tasks). At each regime change, the field has gained a new tool for measuring progress; at each regime change, the new tool has been useful for a shorter period than its predecessor.

The structural cause of this acceleration is the same structural cause of the broader AI capability acceleration the field has been managing for the past four years. When capability doubles every four to seven months (the METR finding for the recent period), any fixed benchmark gets exhausted on a timescale comparable to its development. A benchmark that takes eighteen months to develop and validate is, by the time it is ready for field use, halfway to saturation by frontier models. A benchmark that takes six months to develop is, by the time of release, already partially saturated. The field’s traditional academic-paper-and-leaderboard cycle, in which a benchmark is developed over a year or two, published, and then used for several years of comparative evaluation, has been compressed to a degree that makes the cycle’s central work — sustained comparison across model generations — substantially harder than it was five years ago.

The field’s response to the speed problem, in 2025 and 2026, has been threefold. First, it has shifted toward continuously-updated benchmarks (live leaderboards that incorporate new tasks regularly, rather than fixed datasets that age) — the LiveBench project from a Princeton consortium is the prototype. Second, it has shifted toward capability-elicitation methods that test models against the upper end of their known capability rather than against a fixed reference — the capability elicitation literature emerging in 2025 around frontier safety evaluation is the most rigorous example. Third, it has accepted, with varying levels of explicitness across the major labs, that the field’s collective ability to make precise comparisons across model generations has degraded, and has substituted vibes-based qualitative assessment (the field’s term, not mine) for some fraction of the work that the saturated benchmarks used to do.

The vibes-based-assessment shift is the most uncomfortable of the three responses to acknowledge. The major labs, in their model-release announcements of the past six months, have increasingly resorted to qualitative claims about model capability that the field has no rigorous mechanism for verifying — “feels like a real junior engineer,” “approaching the capability of an experienced researcher,” “the first model that can sustain a multi-hour debugging session without losing the thread.” These claims are, in the absence of saturating benchmarks, the best the field can currently offer; they are also, by any reasonable methodological standard, considerably less rigorous than the leaderboard claims of three years ago. The field is, in mid-2026, in the position of having to choose between rigorous-but-saturated benchmarks and qualitative-but-unverifiable assessment, and the choice is not satisfactory in either direction.

What This Means

The three observations I should like to leave plainly are these.

The first is that evaluation is, more than any single other technical activity, the way the AI field knows whether progress is real. Without rigorous evaluation, the field’s collective story about capability progress becomes the marketing department’s story about capability progress, and the resulting confusion is impossible to escape without spending considerable time in conversations with the people actually using the models. The evaluation crisis of the past two years has, in important ways, already pushed the field into the marketing-department’s-story regime, and the consequences are visible in the increasing divergence between the published model-release claims and the experienced quality of the deployed models.

The second is that the speed problem is unlikely to resolve on the present trajectory. As long as capability continues to double on a four-to-seven-month timescale, any benchmark designed in 2026 will be saturated by the frontier models of 2027 or 2028; any benchmark designed in 2027 will be saturated by 2028 or 2029. The field’s traditional cycle of develop-validate-publish-compare cannot keep pace with the cycle of model release, and the gap is, on present trajectory, widening rather than narrowing. The shift to economic evaluation is partly a response to this — economic quantities scale continuously rather than topping out at a benchmark ceiling — but the economic-evaluation regime is itself early and fragmented, and its convergence to a stable set of references is at least a year away.

The third is the one I should like to leave plainly. The most consequential development in artificial intelligence in the twelve months between mid-2025 and mid-2026 was not any individual model release. It was the structural transition in how the field measures itself, and the structural problem that the transition has not yet solved. A field that knows what it can build but cannot agree on how to measure what it has built is a field in which the loudest voices — the labs with the largest marketing budgets, the academics with the most-publicised claims, the commentators with the most-followed feeds — are at a substantial advantage over the more careful voices. The eval crisis is, in this sense, a crisis of the field’s epistemic discipline as much as of its technical methodology, and the resolution of one will not be possible without the resolution of the other.

The economic-evaluation regime is the field’s current best attempt at building back the rigorous comparison that the leaderboard and agentic regimes have lost. Whether it will hold up to the same acceleration that broke its predecessors is, on present evidence, an open question — and a question whose answer will, in some real sense, determine whether the field’s next year of capability gains can be reliably described or whether it will have to be, in the diminishing-but-still-present sense, taken on the labs’ word.