CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games

Abstract

Recent advances in Vision-Language-Action models (VLAs) have expanded thecapabilities of embodied intelligence. However, significant challenges remainin real-time decision-making in complex 3D environments, which demandsecond-level responses, high-resolution perception, and tactical reasoningunder dynamic conditions. To advance the field, we introduce CombatVLA, anefficient VLA model optimized for combat tasks in 3D action role-playinggames(ARPGs). Specifically, our CombatVLA is a 3B model trained on video-actionpairs collected by an action tracker, where the data is formatted asaction-of-thought (AoT) sequences. Thereafter, CombatVLA seamlessly integratesinto an action execution framework, allowing efficient inference through ourtruncated AoT strategy. Experimental results demonstrate that CombatVLA notonly outperforms all existing models on the combat understanding benchmark butalso achieves a 50-fold acceleration in game combat. Moreover, it has a highertask success rate than human players. We will open-source all resources,including the action tracker, dataset, benchmark, model weights, training code,and the implementation of the framework at https://combatvla.github.io/.