Nondeterministic Polynomial-time Problem Challenge: An Ever-Scaling Reasoning Benchmark for LLMs

  • 2025-04-15 14:40:29
  • Chang Yang, Ruiyu Wang, Junzhe Jiang, Qi Jiang, Qinggang Zhang, Yanchen Deng, Shuxin Li, Shuyue Hu, Bo Li, Florian T. Pokorny, Xiao Huang, Xinrun Wang
  • 0

Abstract

Reasoning is the fundamental capability of large language models (LLMs). Dueto the rapid progress of LLMs, there are two main issues of current benchmarks:i) these benchmarks can be crushed in a short time (less than 1 year), and ii)these benchmarks may be easily hacked. To handle these issues, we propose theever-scalingness for building the benchmarks which are uncrushable, unhackable,auto-verifiable and general. This paper presents NondeterministicPolynomial-time Problem Challenge (NPPC), an ever-scaling reasoning benchmarkfor LLMs. Specifically, the NPPC has three main modules: i) npgym, whichprovides a unified interface of 25 well-known NP-complete problems and cangenerate any number of instances with any levels of complexities, ii) npsolver:which provides a unified interface to evaluate the problem instances with bothonline and offline models via APIs and local deployments, respectively, andiii) npeval: which provides the comprehensive and ready-to-use tools to analyzethe performances of LLMs over different problems, the number of tokens, the ahamoments, the reasoning errors and the solution errors. Extensive experimentsover widely-used LLMs demonstrate: i) NPPC can successfully decrease theperformances of advanced LLMs' performances to below 10%, demonstrating thatNPPC is uncrushable, ii) DeepSeek-R1, Claude-3.7-Sonnet, and o1/o3-mini are themost powerful LLMs, where DeepSeek-R1 outperforms Claude-3.7-Sonnet ando1/o3-mini in most NP-complete problems considered, and iii) the numbers oftokens, aha moments in the advanced LLMs, e.g., Claude-3.7-Sonnet andDeepSeek-R1, are observed first to increase and then decrease when the probleminstances become more and more difficult. We believe that NPPC is the firstever-scaling reasoning benchmark, serving as the uncrushable and unhackabletestbed for LLMs toward artificial general intelligence (AGI).