Boosting Virtual Agent Learning and Reasoning: A Step-wise, Multi-dimensional, and Generalist Reward Model with Benchmark

Abstract

The development of Generalist Virtual Agents (GVAs) powered by MultimodalLarge Language Models (MLLMs) has shown significant promise in autonomous taskexecution. However, current training paradigms face critical limitations,including reliance on outcome supervision and labor-intensive humanannotations. To address these challenges, we propose Similar, a Step-wiseMulti-dimensional Generalist Reward Model, which offers fine-grained signalsfor agent training and can choose better action for inference-time scaling.Specifically, we begin by systematically defining five dimensions forevaluating agent actions. Building on this framework, we design an MCTS-Palgorithm to automatically collect and annotate step-wise, five-dimensionalagent execution data. Using this data, we train Similar with the Triple-Mstrategy. Furthermore, we introduce the first benchmark in the virtual agentdomain for step-wise, multi-dimensional reward model training and evaluation,named SRM. This benchmark consists of two components: SRMTrain, which serves asthe training set for Similar, and SRMEval, a manually selected test set forevaluating the reward model. Experimental results demonstrate that Similar,through its step-wise, multi-dimensional assessment and synergistic gain,provides GVAs with effective intermediate signals during both training andinference-time scaling. The code is available athttps://github.com/Galery23/Similar-v1.