REBEL: Reinforcement Learning via Regressing Relative Rewards

  • 2024-12-10 03:17:30
  • Zhaolin Gao, Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, J. Andrew Bagnell, Jason D. Lee, Wen Sun
  • 0

Abstract

While originally developed for continuous control problems, Proximal PolicyOptimization (PPO) has emerged as the work-horse of a variety of reinforcementlearning (RL) applications, including the fine-tuning of generative models.Unfortunately, PPO requires multiple heuristics to enable stable convergence(e.g. value networks, clipping), and is notorious for its sensitivity to theprecise implementation of these components. In response, we take a step backand ask what a minimalist RL algorithm for the era of generative models wouldlook like. We propose REBEL, an algorithm that cleanly reduces the problem ofpolicy optimization to regressing the relative reward between two completionsto a prompt in terms of the policy, enabling strikingly lightweightimplementation. In theory, we prove that fundamental RL algorithms like NaturalPolicy Gradient can be seen as variants of REBEL, which allows us to match thestrongest known theoretical guarantees in terms of convergence and samplecomplexity in the RL literature. REBEL can also cleanly incorporate offlinedata and be extended to handle the intransitive preferences we frequently seein practice. Empirically, we find that REBEL provides a unified approach tolanguage modeling and image generation with stronger or similar performance asPPO and DPO, all while being simpler to implement and more computationallyefficient than PPO. When fine-tuning Llama-3-8B-Instruct, REBEL achieves strongperformance in AlpacaEval 2.0, MT-Bench, and Open LLM Leaderboard.