Abstract
Large language models (LLMs) demonstrate strong potential as agents for toolinvocation due to their advanced comprehension and planning capabilities. Usersincreasingly rely on LLM-based agents to solve complex missions throughiterative interactions. However, existing benchmarks predominantly accessagents in single-mission scenarios, failing to capture real-world complexity.To bridge this gap, we propose the Multi-Mission Tool Bench. In the benchmark,each test case comprises multiple interrelated missions. This design requiresagents to dynamically adapt to evolving demands. Moreover, the proposedbenchmark explores all possible mission-switching patterns within a fixedmission number. Specifically, we propose a multi-agent data generationframework to construct the benchmark. We also propose a novel method toevaluate the accuracy and efficiency of agent decisions with dynamic decisiontrees. Experiments on diverse open-source and closed-source LLMs revealcritical factors influencing agent robustness and provide actionable insightsto the tool invocation society.