Abstract
One way to enhance the reasoning capability of Large Language Models (LLMs)is to conduct Supervised Fine-Tuning (SFT) using Chain-of-Thought (CoT)annotations. This approach does not show sufficiently strong generalizationability, however, because the training only relies on the given CoT data. Inmath problem-solving, for example, there is usually only one annotatedreasoning path for each question in the training data. Intuitively, it would bebetter for the algorithm to learn from multiple annotated reasoning paths givena question. To address this issue, we propose a simple yet effective approachcalled Reinforced Fine-Tuning (ReFT) to enhance the generalizability oflearning LLMs for reasoning, with math problem-solving as an example. ReFTfirst warmups the model with SFT, and then employs on-line reinforcementlearning, specifically the PPO algorithm in this paper, to further fine-tunethe model, where an abundance of reasoning paths are automatically sampledgiven the question and the rewards are naturally derived from the ground-truthanswers. Extensive experiments on GSM8K, MathQA, and SVAMP datasets show thatReFT significantly outperforms SFT, and the performance can be potentiallyfurther boosted by combining inference-time strategies such as majority votingand re-ranking. Note that ReFT obtains the improvement by learning from thesame training questions as SFT, without relying on extra or augmented trainingquestions. This indicates a superior generalization ability for ReFT.