Abstract
The recent DeepSeek-R1 has showcased the emergence of reasoning capabilitiesin LLMs through reinforcement learning (RL) with rule-based rewards. Despiteits success in language models, its application in multi-modal domains,particularly in graphic user interface (GUI) agent tasks, remainsunder-explored. To address this issue, we propose UI-R1, the first framework toexplore how rule-based RL can enhance the reasoning capabilities of multimodallarge language models (MLLMs) for GUI action prediction tasks. Specifically,UI-R1 introduces a novel rule-based action reward, enabling model optimizationvia policy-based algorithms such as Group Relative Policy Optimization (GRPO).For efficient training, we curate a small yet high-quality dataset of 136challenging tasks, encompassing five common action types on mobile devices.Experimental results demonstrate that our proposed UI-R1-3B achievessignificant improvements over the base model (i.e. Qwen2.5-VL-3B) on bothin-domain (ID) and out-of-domain (OOD) tasks, with average accuracy gains of22.1% on ScreenSpot, 6.0% on ScreenSpot-Pro, and 12.7% on ANDROIDCONTROL.Furthermore, UI-R1-3B delivers competitive performance compared to largermodels (e.g., OS-Atlas-7B) trained via supervised fine-tuning (SFT) on 76Ksamples. These results underscore the potential of rule-based reinforcementlearning to advance GUI understanding and control, paving the way for futureresearch in this domain. Code website: https://github.com/lll6gg/UI-R1.