Improving Generalization in Intent Detection: GRPO with Reward-Based Curriculum Sampling

Abstract

Intent detection, a critical component in task-oriented dialogue (TOD)systems, faces significant challenges in adapting to the rapid influx ofintegrable tools with complex interrelationships. Existing approaches, such aszero-shot reformulations and LLM-based dynamic recognition, struggle withperformance degradation when encountering unseen intents, leading to erroneoustask routing. To enhance the model's generalization performance on unseentasks, we employ Reinforcement Learning (RL) combined with a Reward-basedCurriculum Sampling (RCS) during Group Relative Policy Optimization (GRPO)training in intent detection tasks. Experiments demonstrate that RL-trainedmodels substantially outperform supervised fine-tuning (SFT) baselines ingeneralization. Besides, the introduction of the RCS, significantly bolstersthe effectiveness of RL in intent detection by focusing the model onchallenging cases during training. Moreover, incorporating Chain-of-Thought(COT) processes in RL notably improves generalization in complex intentdetection tasks, underscoring the importance of thought in challengingscenarios. This work advances the generalization of intent detection tasks,offering practical insights for deploying adaptable dialogue systems.