WebLists: Extracting Structured Information From Complex Interactive Websites Using Executable LLM Agents

Abstract

Most recent web agent research has focused on navigation and transactiontasks, with little emphasis on extracting structured data at scale. We presentWebLists, a benchmark of 200 data-extraction tasks across four common businessand enterprise use-cases. Each task requires an agent to navigate to a webpage,configure it appropriately, and extract complete datasets with well-definedschemas. We show that both LLMs with search capabilities and SOTA web agentsstruggle with these tasks, with a recall of 3% and 31%, respectively, despitehigher performance on question-answering tasks. To address this challenge, we propose BardeenAgent, a novel framework thatenables web agents to convert their execution into repeatable programs, andreplay them at scale across pages with similar structure. BardeenAgent is alsothe first LLM agent to take advantage of the regular structure of HTML. Inparticular BardeenAgent constructs a generalizable CSS selector to capture allrelevant items on the page, then fits the operations to extract the data. On the WebLists benchmark, BardeenAgent achieves 66% recall overall, morethan doubling the performance of SOTA web agents, and reducing cost per outputrow by 3x.