These days, it is popular to say that the tests we use to assess AI systems are bad. Multiple recent headlines capture the feeling: “Maybe we should ignore AI benchmarks for now”, “Are Better Models Better?”, “Everyone Is Judging AI by These Tests. But Experts Say They’re Close to Meaningless”.
I think that this sentiment is justified. Having worked in the (human-) assessment industry the last few years, I think the current state of AI evaluation is not great, and I have a hunch as to why. From my standpoint, it seems like our publicly-available AI tests are lacking because no actor has the incentive to make them very good. Put another way: who cares if your eval is a little bit better? Does that matter to anyone, enough that they will pay you more for it?
In this post I briefly outline how the traditional assessment industry operates, describe how AI evals are different, and try to identify key factors that could change things. I will take for granted that there is a problem with the quality of AI evaluations, and will set aside the specific concerns that others have raised about them, like technical flaws and gaps between what AI developers claim to be measuring and what their tests actually tell us. I think these problems are likely real, and that they won’t be fixed unless the upstream incentive design problem is addressed.
Exam Economics
When we want to assess what someone knows, or what they can do, or what traits they have, we sometimes turn to standardized tests. Testing costs money. At a high level, those costs break down into costs to develop the test and costs to administer the test. Typically, examinees or their sponsors cover these costs by paying a fee for each attempt at the test, since they stand to benefit from the results. Both volume and high prices are important to recoup costs in testing. As an example, roughly 2 million examinees take the SAT every year, each paying around $70 in fees per sitting. Although low-volume and/or inexpensive tests also exist, the unit economics of standardized testing generally rely on a given test having many thousands of examinees per year who take it, each paying a high fee ($50-$1000 in the USA) per sitting.
In spite of this, or perhaps because of it, tests like these tend to come from tax-exempt nonprofits or not-for-profits. Up until 2024, the SAT, ACT, GRE, GMAT, and LSAT were all developed and administered by 501(c)(3)s. This kind of status allows organizations like these to visibly signal that they will try to act as impartial assessors, untainted by profit motive. Impartiality is a double-edged sword: it is good for credibility and rigor, but bad for market segmentation. While airlines can improve their business by charging corporate travelers more for an upgraded experience, test developers and administrators must offer a “standardized” service to all examinees. Fundamentally, an impartial producer in the testing industry is unable to sell what its customers want most, which is a high test score.
Since these organizations are nonprofits, we actually have a decent sense of how they operate, including their cost structures. If you are interested in details about what these costs look like, the rest of this section will talk more about them. If you just want the AI-specific stuff, jump down to the next section, “Software Examinees are Special”.
Test Development
Test development costs scale with how important the test is, with the number of items (test questions, essay prompts, etc.) on it, and with how frequently the items must be updated. High-stakes tests like those in university admissions and professional licensure involve a very intensive development process since if examinees receive low scores because of poor test design, the test developer can be sued for this. For instance, an LSAT examinee might bring a case alleging, “Because of these bad items, I’ll miss out on becoming a lawyer!” When accounting for initial drafting, revision, and quality control, each item on a high-stakes test costs in the hundreds or thousands of dollars to create. As one datapoint, back in 2007 an employee working on the GMAT test stated that developing a single official item cost on the order of $1,500–$2,500.
The process is roughly as follows. Subject matter experts—either full-time staff or outside experts working on a contract basis—write the bank of items based on a test specification. Test developers will typically also have dedicated psychometricians on staff to ensure that each item is well-constructed, to help measure how the items perform in the field, to analyze the results, and finally to arrange the bank into one or more tests. Items go through rounds of editing, statistical analysis, and revision before they ever appear as an actual scored item. For every item that survives unchanged to this point, many others are discarded or rewritten during the intermediate phases.
Test Administration
Test administration is the process of facilitating (or “delivering”) a test, scoring responses, and sending out results. This involves significantly different costs and operations, enough so that for decades, major test developers have outsourced this work to other companies who specialize in test administration.
Test delivery costs can be large and scale with the number of examinees, since that determines the scale of logistics to produce the test materials, to get examinees into/out of the testing environment, and to proctor the test itself.
Costs from test scoring depend on the kind of item to be scored, and scale with the number of examinee responses. There are two broad kinds of items, each of which requires a different scoring process: selected response items such as multiple choice questions, and constructed response items such as essay questions. Multiple-choice is the most common because it is easy to score. Paper tests that use multiple-choice items have long been easy to score with the right technology, and digital multiple-choice items are even easier to score. Constructed response items most commonly need to be graded by humans, in a process that involves training a large temporary staff to how to apply specific grading rubric, multiple graders independently scoring each response with that rubric, and then additional human review to resolve any discrepancies. All of that is very labor intensive.
There are also costs involved in securely reporting results of a test to its intended recipient, but those are relatively small contributors to test administration costs, so I will ignore them.
Software Examinees are Special
How does AI testing compare? A key difference is that the number of players at the leading edge of AI is very small, and may remain so in the near future. While there are millions of prospective college students trying to score high on the SAT each year, the number of leading developers is more like a dozen, with each of them only producing a handful of new AI systems each year. If this holds, there may only be a few dozen different “software examinees” competing on high-stakes tests.
Reduced Test Administration Costs
All else equal, most of the traditional administration costs should be drastically reduced in the case of AI tests. Having fewer independent test takers pushes down the main driver of test administration costs, which is examinee headcount. Moreover, there is no need to coordinate physical logistics, which makes delivering the test more comparable to delivering a remote test.
Administration costs are not entirely eliminated, though. Shifting the test from a physical environment that the administrator controls to a virtual environment that the AI company controls makes it harder to ensure the testing conditions are standardized, which is what valid measurement requires.
There are a few ways that examinees being software makes it hard to standardize the test conditions. Software can be sped up and run in parallel, so “time limits” measured in wall-clock time are less meaningful. Human examinees largely have similar perceptual/motor abilities within a population and can receive instruction in the same formats, although tests are sometimes adapted for accessibility. By contrast, software systems can have different modalities and methods of receiving instruction. Software running on remote servers (as opposed to on admin-controlled machines, as done in Kaggle code competitions) may use the Internet or retrieval databases in the process of responding, so enforcing anything like a “closed-book, no devices” test would require more expensive, intrusive forms of monitoring. Dealing with these should push up the per-examinee cost of administering AI tests.
Similar or Worse Test Development Costs
When AI systems were not very capable and progress was slow, there were many different simple tasks that AI researchers could pick to develop tests for, only refreshing them occasionally as the field plodded along. But this low-hanging fruit could only be picked for so long. It turns out that for many of the things we humans do, we’re actually quite bad at. Tests measuring AI abilities for which we overestimated our own skill level are now rapidly being saturated and becoming less useful.
Now that AI systems are quite capable, there are fewer tasks to differentiate them along. The kinds of items (test questions, essay prompts, etc.) remaining that the best of AI is still bad at are often more expensive to create, because many represent tasks that require rare expertise. For a recent AI evaluation called “Humanity’s Last Exam”, ScaleAI and the Center for AI Safety offered between $500-$5000 a pop for the best multiple-choice questions submitted by human experts around the world. That’s approaching the cost of “real” test items! These difficulties may be what is motivating AI test developers to shift their focus toward puzzles and convoluted synthetic tasks that are easy to create and easy to administer even if they are not economically valuable.
The pace of change in AI is also quite a bit faster, which means that new items are needed more frequently, and thus annual test development costs are higher, even for a fixed test variety. One cause is that AI systems can acquire new affordances that motivate new ways to evaluate them. Another is that the contents of AI tests are typically exposed on the Web, free for the next generation of AI systems to train on, so to keep pace benchmarks need to be refreshed regularly. And improvements in fundamental training methods only compound with these to make AI pass through capability levels more quickly and shorten the shelf lives of tests.
Who Would Pay?
With these costs in mind, the key question is “Who would pay to develop better AI tests, particularly if better tests would likely be more expensive?”
(EDIT: Although this section focuses on market-based demand forces, it's worth noting that a significant share of funding for AI test development comes from government and philanthropic sources.)
On the supply side, most notable AI evaluations are produced by academic ML researchers, who post these freely and openly for others to use. In the testing industry, by contrast, the contents of a test are generally kept private before it is administered and remain non-disclosed even after, in order to maximize the usage they get out of their costly items. Academic ML researchers go against this practice because they are incentivized to have their work cited, and it is harder to get citations for a private test. This dynamic has been the main driver behind the production of AI tests to date, and I am unsure whether continuing down this path will lead to a much better AI ecosystem.
On the demand side, beyond academia, AI companies themselves do substantial in-house testing of their models, including via tests that others produce. Bringing test development in-house as well has the benefit of allowing tighter feedback loops between their tests and the rest of their technologies / processes. If an AI company develops a homegrown test that is very informative and integrated with their internal IP, why would they expose it to the outside world? Instead, AI companies can just withhold the proprietary tests they develop, use their scores on externally-developed open tests for marketing, and import as many externally-developed proprietary tests as they can get their hands on, for internal use. This is plausibly already happening.
Individuals who want to build better, more useful, or more valid tests rely to a certain extent on AI companies to permit and support their work. In one ongoing example, the developer of a popular model comparison website works with AI companies to allow them to add their upcoming models onto the site under codenames. That developer benefits from early access to models, but this also allows AI companies to use the benchmark to A/B test their models. Test developers may even take funding from the AI companies whose models they are testing, in exchange for increased access to the test contents. Ties like these can compromise the credible impartiality of their tests.
As mentioned earlier, there are far fewer different advanced AI systems in need of testing than different human examinees in need of testing. Even if AI companies began paying for tests, the level of customer concentration would complicate the incentives to run an independent test developer. An AI test developer is more dependent on each individual AI company it serves than traditional test developers are on each individual human examinee they serve.
Improving the Incentives
Will this incentive landscape change? I’m not really sure, but it seems unlikely to me without an intervention or a shock to the system.
Software examinees lend themselves to concentration, relative to wetware examinees. There is a huge demand for, say, legal talent that the best-qualified law graduates cannot satisfy, which means that other law graduates can make a living. But the best software for a task can be scaled up to meet whatever the market demands. There is not much of a market for AI also-rans.
Although concentration in the AI market reduces the number of potential buyers for tests, that doesn’t mean the market size must be smaller. Their small number is compensated for by their massive budgets. These AI companies will likely pour much more into training and deployment than what is spent on all forms of human standardized testing, globally. Even 1% of that investment could dramatically raise the quality of available AI tests, if spent well.
Commercial aviation offers a point of comparison. Aircraft safety relies on testing, yet there are only a few manufacturers, which limits the market for independent testing. Governments instead require that every new aircraft design undergo “type certification”, where the manufacturer conducts their own tests to demonstrate compliance and regulators oversee/review their work. Crucially, a scheme like this requires the government to secure a source of funding: in aviation, this funding comes largely from a mix of industry-derived fees and taxpayer dollars.
If we want a competitive, independent market of AI tests, there needs to be robust demand for them. Right now, aside from academic publishing, the main forces creating demand for third-party AI tests are corporate commitments. Because of the promises that frontier AI companies have made, firms like Gryphon Scientific are working with them to develop better tests for key CBRN weapons R&D abilities. In addition to more commitments that pull forward funding for better independent AI tests, I think the ecosystem would benefit from better organizational structures for test development. In the assessment industry, test developers often have a corporate firewall between any divisions that make/sell practice materials for their tests and divisions that make/sell their actual tests, to avoid conflicts of interest in the development process. Remember that in testing, impartiality is the name of the game.