Multi-Mission Tool Bench: Assessing the Robustness of LLM based Agents through Related and Dynamic Missions

Abstract

Large language models (LLMs) demonstrate strong potential as agents for toolinvocation due to their advanced comprehension and planning capabilities. Usersincreasingly rely on LLM-based agents to solve complex missions throughiterative interactions. However, existing benchmarks predominantly accessagents in single-mission scenarios, failing to capture real-world complexity.To bridge this gap, we propose the Multi-Mission Tool Bench. In the benchmark,each test case comprises multiple interrelated missions. This design requiresagents to dynamically adapt to evolving demands. Moreover, the proposedbenchmark explores all possible mission-switching patterns within a fixedmission number. Specifically, we propose a multi-agent data generationframework to construct the benchmark. We also propose a novel method toevaluate the accuracy and efficiency of agent decisions with dynamic decisiontrees. Experiments on diverse open-source and closed-source LLMs revealcritical factors influencing agent robustness and provide actionable insightsto the tool invocation society.