Entropy-guided sequence weighting for efficient exploration in RL-based LLM fine-tuning

Abstract

We introduce Entropy-Guided Sequence Weighting (EGSW), a novel approach thatenhances the exploration-exploitation tradeoff by dynamically assigning weightsto generated outputs based on their advantage and entropy for ReinforcementLearning-based Large Language Model fine-tuning. EGSW integrates entropyregularization with advantage-based weighting to balance policy updates,enabling efficient exploration in high-dimensional state spaces. By employingtemperature-scaled softmax weighting over sequences, EGSW prioritizinghigh-reward, high-uncertainty steps while maintaining training stability.Although originally developed to improve Group Relative Policy Optimization(GRPO) during large language model (LLM) fine-tuning, EGSW is generalizable toother reinforcement learning (RL) algorithms and can be implemented in bothstep-wise and trajectory-wise settings. Empirical evaluations demonstrate thatEGSW enhances GRPO reasoning ability, yielding improvements in sampleefficiency. Future work will explore the application of EGSW to advanced RLmethodologies.