Most AI benchmarks evaluate what models know. METR, an AI safety research organization in Berkeley, instead examines what models can do and for how long.
METR introduced the task-completion time horizon, a metric that measures the length of tasks (based on the time a human expert would need) that a frontier AI agent can complete with 50% reliability. If a model has a 50%-time horizon of two hours, it can handle tasks that take a skilled human up to two hours, at least half the time (https://metr.org/).
METR’s findings are significant. Over the past six years, the time horizon has doubled approximately every seven months. If this trend continues through the decade, AI agents could autonomously execute month-long projects (in human time). At that stage, a model would function as an autonomous agent, capable of navigating ambiguity and managing complex, multi-step decisions without human intervention.
It is important to note that METR’s tasks are well-defined and controlled. Performance declines in more complex, real-world tasks. The time horizon should be viewed as a minimum capability, not a maximum.
Nonetheless, the trend is direct. For technology services firms, this capability curve presents both challenges and opportunities: the need to evolve delivery models, and the potential to integrate agents into workflows to improve margins and create new service offerings.
Elaxtra Advisors is an M&A and value-creation advisory firm that assists institutional investors, private equity-owned platforms, and strategic acquirers invest and create value in worldwide technology services companies. Please contact us to explore potential partnerships.