DIPPER: Direct Preference Optimization to Accelerate Primitive-Enabled Hierarchical Reinforcement Learning

Abstract

Learning control policies to perform complex robotics tasks from humanpreference data presents significant challenges. On the one hand, thecomplexity of such tasks typically requires learning policies to perform avariety of subtasks, then combining them to achieve the overall goal. At thesame time, comprehensive, well-engineered reward functions are typicallyunavailable in such problems, while limited human preference data often is;making efficient use of such data to guide learning is therefore essential.Methods for learning to perform complex robotics tasks from human preferencedata must overcome both these challenges simultaneously. In this work, weintroduce DIPPER: Direct Preference Optimization to AcceleratePrimitive-Enabled Hierarchical Reinforcement Learning, an efficienthierarchical approach that leverages direct preference optimization to learn ahigher-level policy and reinforcement learning to learn a lower-level policy.DIPPER enjoys improved computational efficiency due to its use of directpreference optimization instead of standard preference-based approaches such asreinforcement learning from human feedback, while it also mitigates thewell-known hierarchical reinforcement learning issues of non-stationarity andinfeasible subgoal generation due to our use of primitive-informedregularization inspired by a novel bi-level optimization formulation of thehierarchical reinforcement learning problem. To validate our approach, weperform extensive experimental analysis on a variety of challenging roboticstasks, demonstrating that DIPPER outperforms hierarchical and non-hierarchicalbaselines, while ameliorating the non-stationarity and infeasible subgoalgeneration issues of hierarchical reinforcement learning.