One Framework to Rule Them All: Unifying RL-Based and RL-Free Methods in RLHF

Abstract

In this article, we primarily examine a variety of RL-based and RL-freemethods designed to address Reinforcement Learning from Human Feedback (RLHF)and Large Reasoning Models (LRMs). We begin with a concise overview of thetypical steps involved in RLHF and LRMs. Next, we reinterpret several RL-basedand RL-free algorithms through the perspective of neural structured banditprediction, providing a clear conceptual framework that uncovers a deeperconnection between these seemingly distinct approaches. Following this, webriefly review some core principles of reinforcement learning, drawingattention to an often-overlooked aspect in existing RLHF studies. This leads toa detailed derivation of the standard RLHF objective within a full RL context,demonstrating its equivalence to neural structured bandit prediction. Finally,by reinvestigating the principles behind Proximal Policy Optimization (PPO), wepinpoint areas needing adjustment, which culminates in the introduction of theGeneralized Reinforce Optimization (GRO) framework, seamlessly integratingRL-based and RL-free methods in RLHF. We look forward to the community'sefforts to empirically validate GRO and invite constructive feedback.