LR$^2$Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems

Abstract

Recent progress in o1-like models has significantly enhanced the reasoningabilities of Large Language Models (LLMs), empowering them to tackleincreasingly complex tasks through reflection capabilities, such as makingassumptions, backtracking, and self-refinement. However, effectively evaluatingsuch reflection capabilities remains challenging due to the lack of appropriatebenchmarks. To bridge this gap, we introduce LR$^2$Bench, a novel benchmarkdesigned to evaluate the Long-chain Reflective Reasoning capabilities of LLMs.LR$^2$Bench comprises 850 samples across six Constraint Satisfaction Problems(CSPs) where reflective reasoning is crucial for deriving solutions that meetall given constraints. Each type of task focuses on distinct constraintpatterns, such as knowledge-based, logical, and spatial constraints, providinga comprehensive evaluation of diverse problem-solving scenarios. We conductextensive evaluation on both conventional models and o1-like models. Ourexperimental results reveal that even the most advanced reasoning-specificmodels, such as DeepSeek-R1 and OpenAI o1-preview, struggle with tasks inLR$^2$Bench, achieving an average Exact Match score of only 20.0% and 23.6%,respectively. These findings underscore the significant room for improvement inthe reflective reasoning capabilities of current LLMs. The leaderboard of ourbenchmark is available at https://huggingface.co/spaces/UltraRonin/LR2Bench