Abstract
Reinforcement learning from human feedback (RLHF) is one of the keytechniques that helps large language models (LLMs) to follow instructions andprovide helpful and harmless responses. While direct policy optimizationmethods exist, state-of-the-art LLMs adopt RL-based methods (usually PPO) inRLHF to train the policy to generate good responses guided by a reward modellearned from preference data. The main challenge of these methods is theinaccuracy of the intermediate reward model, especially in code generationtasks that require long and complex reasoning to score a response. We find thatthe reliability of the reward model varies across responses assigned withdifferent rewards. This motivates us to filter the samples whose rewards may beunreliable to improve signal-to-noise ratio during policy learning, resultingin Policy Filtration for Proximal Policy Optimization (PF-PPO). To choose aproper policy filtration strategy for a given reward model, the coefficient ofdetermination ($R^2$) between rewards and actual scores on filtered samplesserves as a good metrics and helps us find several promising strategies. Weprovide extensive experiments to validate the effectiveness of PF-PPO in codegeneration tasks, and find that some variants of PF-PPO are highly effectiveand achieve new state-of-the-art performance across 7-billion-parameter modelson HumanEval, MBPP, and a new and more challenging LeetCode Contest benchmark.