RIG: Synergizing Reasoning and Imagination in End-to-End Generalist Policy

Abstract

Reasoning before action and imagining potential outcomes (i.e., world models)are essential for embodied agents operating in complex open-world environments.Yet, prior work either incorporates only one of these abilities in anend-to-end agent or integrates multiple specialized models into an agentsystem, limiting the learning efficiency and generalization of the policy.Thus, this paper makes the first attempt to synergize Reasoning and Imaginationin an end-to-end Generalist policy, termed RIG. To train RIG in an end-to-endmanner, we construct a data pipeline that progressively integrates and enrichesthe content of imagination and reasoning in the trajectories collected fromexisting agents. The joint learning of reasoning and next image generationexplicitly models the inherent correlation between reasoning, action, anddynamics of environments, and thus exhibits more than $17\times$ sampleefficiency improvements and generalization in comparison with previous works.During inference, RIG first reasons about the next action, produces potentialaction, and then predicts the action outcomes, which offers the agent a chanceto review and self-correct based on the imagination before taking real actions.Experimental results show that the synergy of reasoning and imagination notonly improves the robustness, generalization, and interoperability ofgeneralist policy but also enables test-time scaling to enhance overallperformance.