UK's AI Security Institute finds standard benchmarks systematically underestimate what AI agents can actually do
A UK AI Security Institute study found standard AI benchmarks underestimate agent capabilities by capping compute budgets, with success rates rising 25% on software tasks when token budgets increased tenfold.

- Standard AI benchmarks underestimate agent capabilities by capping compute budgets, leading to misleading performance measurements.
- Increasing token budgets by tenfold raised software engineering task success rates by 25%, with newer models benefiting the most.
- Previous estimates of AI progress at the frontier may be understated by up to 60% due to compute constraints in benchmarks.
- AISI’s findings call for revised evaluation frameworks that better reflect real-world computational conditions.
The UK's AI Security Institute (AISI) published a study examining seven widely used AI benchmarks and discovered a critical flaw: these evaluations systematically underestimate agent capabilities by imposing strict compute limits. When researchers relaxed the token budget constraints by tenfold, success rates on software engineering tasks increased by approximately 25%. The effect was most pronounced for newer models, which showed the steepest gains in performance.
The findings suggest that previous measurements of AI progress at the frontier may have underestimated actual advancements by around 60%. This discrepancy arises because standard benchmarks often cap computational resources, failing to reflect real-world conditions where agents can leverage significantly more compute. The study highlights the need for revised evaluation frameworks that account for variable compute budgets to provide a more accurate picture of AI capabilities.
AISI’s research underscores a growing concern in AI evaluation: benchmarks designed to measure progress may inadvertently misrepresent it by ignoring the role of compute scaling. This revelation comes as AI agents become more prevalent in practical applications, where computational resources are less constrained than in controlled testing environments.
Source: UK's AI Security Institute finds standard benchmarks systematically underestimate what AI agents can actually do. Read the full piece at the source.
Developers should reconsider how they benchmark AI agents, ensuring evaluations account for variable compute budgets to avoid underestimating performance.
Companies deploying AI agents must understand that benchmark results may not reflect real-world capabilities, potentially leading to misaligned expectations.
Investors in AI companies should scrutinize benchmark claims, as understated performance could mask true competitive advantages.
The public may be misled by benchmark results that don’t reflect the full potential of AI agents in practical applications.
- token budget
- The maximum number of tokens (words or subwords) an AI model can process in a single input or output, often used to limit computational resources in benchmarks.
- AI agents
- Autonomous or semi-autonomous AI systems designed to perform tasks, often with the ability to use tools or interact with software environments.
UN Artificial Intelligence Panel Launches Report - Havana Times
International Conference on Machine Learning (ICML) 2026 - Apple Machine Learning Research
AI shows promise in the fight against fake news - Gavi, the Vaccine Alliance
