Abstract
Advancements in large language models (LLMs) have paved the way for LLM-basedagent systems that offer enhanced accuracy and interpretability across variousdomains. Radiology, with its complex analytical requirements, is an ideal fieldfor the application of these agents. This paper aims to investigate thepre-requisite question for building concrete radiology agents which is, `Canmodern LLMs act as agent cores in radiology environments?' To investigate it,we introduce RadABench with three-fold contributions: First, we presentRadABench-Data, a comprehensive synthetic evaluation dataset for LLM-basedagents, generated from an extensive taxonomy encompassing 6 anatomies, 5imaging modalities, 10 tool categories, and 11 radiology tasks. Second, wepropose RadABench-EvalPlat, a novel evaluation platform for agents featuring aprompt-driven workflow and the capability to simulate a wide range of radiologytoolsets. Third, we assess the performance of 7 leading LLMs on our benchmarkfrom 5 perspectives with multiple metrics. Our findings indicate that whilecurrent LLMs demonstrate strong capabilities in many areas, they are still notsufficiently advanced to serve as the central agent core in a fully operationalradiology agent system. Additionally, we identify key factors influencing theperformance of LLM-based agent cores, offering insights for clinicians on howto apply agent systems in real-world radiology practices effectively. All ofour code and data are open-sourced inhttps://github.com/MAGIC-AI4Med/RadABench.