CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction

  • 2024-12-09 18:59:18
  • Zhefei Gong, Pengxiang Ding, Shangke Lyu, Siteng Huang, Mingyang Sun, Wei Zhao, Zhaoxin Fan, Donglin Wang
  • 0

Abstract

In robotic visuomotor policy learning, diffusion-based models have achievedsignificant success in improving the accuracy of action trajectory generationcompared to traditional autoregressive models. However, they suffer frominefficiency due to multiple denoising steps and limited flexibility fromcomplex constraints. In this paper, we introduce Coarse-to-Fine AutoRegressivePolicy (CARP), a novel paradigm for visuomotor policy learning that redefinesthe autoregressive action generation process as a coarse-to-fine, next-scaleapproach. CARP decouples action generation into two stages: first, an actionautoencoder learns multi-scale representations of the entire action sequence;then, a GPT-style transformer refines the sequence prediction through acoarse-to-fine autoregressive process. This straightforward and intuitiveapproach produces highly accurate and smooth actions, matching or evensurpassing the performance of diffusion-based policies while maintainingefficiency on par with autoregressive policies. We conduct extensiveevaluations across diverse settings, including single-task and multi-taskscenarios on state-based and image-based simulation benchmarks, as well asreal-world tasks. CARP achieves competitive success rates, with up to a 10%improvement, and delivers 10x faster inference compared to state-of-the-artpolicies, establishing a high-performance, efficient, and flexible paradigm foraction generation in robotic tasks.