Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization

Abstract

Reinforcement learning from human feedback (RLHF) is an effective method foraligning large language models (LLMs) with human values. However, rewardover-optimization remains an open challenge leading to discrepancies betweenthe performance of LLMs under the reward model and the true human objectives. Aprimary contributor to reward over-optimization is the extrapolation error thatarises when the reward model evaluates out-of-distribution (OOD) responses.However, current methods still fail to prevent the increasing frequency of OODresponse generation during the reinforcement learning (RL) process and are noteffective at handling extrapolation errors from OOD responses. In this work, wepropose the Behavior-Supported Policy Optimization (BSPO) method to mitigatethe reward over-optimization issue. Specifically, we define behavior policy asthe next token distribution of the reward training dataset to model thein-distribution (ID) region of the reward model. Building on this, we introducethe behavior-supported Bellman operator to regularize the value function,penalizing all OOD values without impacting the ID ones. Consequently, BSPOreduces the generation of OOD responses during the RL process, thereby avoidingoverestimation caused by the reward model's extrapolation errors.Theoretically, we prove that BSPO guarantees a monotonic improvement of thesupported policy until convergence to the optimal behavior-supported policy.Empirical results from extensive experiments show that BSPO outperformsbaselines in preventing reward over-optimization due to OOD evaluation andfinding the optimal ID policy.