Diffusion Policy Policy Optimization

Abstract

We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmicframework including best practices for fine-tuning diffusion-based policies(e.g. Diffusion Policy) in continuous control and robot learning tasks usingthe policy gradient (PG) method from reinforcement learning (RL). PG methodsare ubiquitous in training RL policies with other policy parameterizations;nevertheless, they had been conjectured to be less efficient fordiffusion-based policies. Surprisingly, we show that DPPO achieves thestrongest overall performance and efficiency for fine-tuning in commonbenchmarks compared to other RL methods for diffusion-based policies and alsocompared to PG fine-tuning of other policy parameterizations. Throughexperimental investigation, we find that DPPO takes advantage of uniquesynergies between RL fine-tuning and the diffusion parameterization, leading tostructured and on-manifold exploration, stable training, and strong policyrobustness. We further demonstrate the strengths of DPPO in a range ofrealistic settings, including simulated robotic tasks with pixel observations,and via zero-shot deployment of simulation-trained policies on robot hardwarein a long-horizon, multi-stage manipulation task. Website with code:diffusion-ppo.github.io