TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

  • 2024-12-18 18:55:40
  • Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, Graham Neubig
  • 0

Abstract

We interact with computers on an everyday basis, be it in everyday life orwork, and many aspects of work can be done entirely with access to a computerand the Internet. At the same time, thanks to improvements in large languagemodels (LLMs), there has also been a rapid development in AI agents thatinteract with and affect change in their surrounding environments. But howperformant are AI agents at helping to accelerate or even autonomously performwork-related tasks? The answer to this question has important implications forboth industry looking to adopt AI into their workflows, and for economic policyto understand the effects that adoption of AI may have on the labor market. Tomeasure the progress of these LLM agents' performance on performing real-worldprofessional tasks, in this paper, we introduce TheAgentCompany, an extensiblebenchmark for evaluating AI agents that interact with the world in similar waysto those of a digital worker: by browsing the Web, writing code, runningprograms, and communicating with other coworkers. We build a self-containedenvironment with internal web sites and data that mimics a small softwarecompany environment, and create a variety of tasks that may be performed byworkers in such a company. We test baseline agents powered by both closedAPI-based and open-weights language models (LMs), and find that with the mostcompetitive agent, 24% of the tasks can be completed autonomously. This paintsa nuanced picture on task automation with LM agents -- in a setting simulatinga real workplace, a good portion of simpler tasks could be solved autonomously,but more difficult long-horizon tasks are still beyond the reach of currentsystems.