Dynamic Planning for LLM-based Graphical User Interface Automation

Abstract

The advent of large language models (LLMs) has spurred considerable interestin advancing autonomous LLMs-based agents, particularly in intriguingapplications within smartphone graphical user interfaces (GUIs). When presentedwith a task goal, these agents typically emulate human actions within a GUIenvironment until the task is completed. However, a key challenge lies indevising effective plans to guide action prediction in GUI tasks, thoughplanning have been widely recognized as effective for decomposing complex tasksinto a series of steps. Specifically, given the dynamic nature of environmentalGUIs following action execution, it is crucial to dynamically adapt plans basedon environmental feedback and action history.We show that the widely-used ReActapproach fails due to the excessively long historical dialogues. To addressthis challenge, we propose a novel approach called Dynamic Planning of Thoughts(D-PoT) for LLM-based GUI agents.D-PoT involves the dynamic adjustment ofplanning based on the environmental feedback and execution history.Experimental results reveal that the proposed D-PoT significantly surpassed thestrong GPT-4V baseline by +12.7% (34.66% $\rightarrow$ 47.36%) in accuracy. Theanalysis highlights the generality of dynamic planning in different backboneLLMs, as well as the benefits in mitigating hallucinations and adapting tounseen tasks. Code is available at https://github.com/sqzhang-lazy/D-PoT.