Abstract
Recent advances in learning decision-making policies can largely beattributed to training expressive policy models, largely via imitationlearning. While imitation learning discards non-expert data, reinforcementlearning (RL) can still learn from suboptimal data. However, instantiating RLtraining of a new policy class often presents a different challenge: most deepRL machinery is co-developed with assumptions on the policy class and backbone,resulting in poor performance when the policy class changes. For instance, SACutilizes a low-variance reparameterization policy gradient for Gaussianpolicies, but this is unstable for diffusion policies and intractable forautoregressive categorical policies. To address this issue, we develop anoffline RL and online fine-tuning approach called policy-agnostic RL (PA-RL)that can effectively train multiple policy classes, with varying architecturesand sizes. We build off the basic idea that a universal supervised learningloss can replace the policy improvement step in RL, as long as it is applied on"optimized" actions. To obtain these optimized actions, we first samplemultiple actions from a base policy, and run global optimization (i.e.,re-ranking multiple action samples using the Q-function) and local optimization(i.e., running gradient steps on an action sample) to maximize the critic onthese candidates. PA-RL enables fine-tuning diffusion and transformer policieswith either autoregressive tokens or continuous action outputs, at differentsizes, entirely via actor-critic RL. Moreover, PA-RL improves the performanceand sample-efficiency by up to 2 times compared to existing offline RL andonline fine-tuning methods. We show the first result that successfullyfine-tunes OpenVLA, a 7B generalist robot policy, autonomously with Cal-QL, anonline RL fine-tuning algorithm, improving from 40% to 70% in the real world in40 minutes.