Learning to Reason under Off-Policy Guidance

Abstract

Recent advances in large reasoning models (LRMs) demonstrate thatsophisticated behaviors such as multi-step reasoning and self-reflection canemerge via reinforcement learning (RL) with simple rule-based rewards. However,existing zero-RL approaches are inherently ``on-policy'', limiting learning toa model's own outputs and failing to acquire reasoning abilities beyond itsinitial capabilities. We introduce LUFFY (Learning to reason Under oFF-policYguidance), a framework that augments zero-RL with off-policy reasoning traces.LUFFY dynamically balances imitation and exploration by combining off-policydemonstrations with on-policy rollouts during training. Notably, we proposepolicy shaping via regularized importance sampling to avoid superficial andrigid imitation during mixed-policy training. Remarkably, LUFFY achieves anover +7.0 average gain across six math benchmarks and an advantage of over +6.2points in out-of-distribution tasks. It also substantially surpassesimitation-based supervised fine-tuning (SFT), particularly in generalization.Analysis shows LUFFY not only imitates effectively but also explores beyonddemonstrations, offering a scalable path to train generalizable reasoningmodels with off-policy guidance.