GAOKAO-Eval: Does high scores truly reflect strong capabilities in LLMs?

Abstract

Large Language Models (LLMs) are commonly evaluated using human-craftedbenchmarks, under the premise that higher scores implicitly reflect strongerhuman-like performance. However, there is growing concern that LLMs may ``game"these benchmarks due to data leakage, achieving high scores while strugglingwith tasks simple for humans. To substantively address the problem, we createGAOKAO-Eval, a comprehensive benchmark based on China's National CollegeEntrance Examination (Gaokao), and conduct ``closed-book" evaluations forrepresentative models released prior to Gaokao. Contrary to prevailingconsensus, even after addressing data leakage and comprehensiveness,GAOKAO-Eval reveals that high scores still fail to truly reflect human-alignedcapabilities. To better understand this mismatch, We introduce the Rasch modelfrom cognitive psychology to analyze LLM scoring patterns and identify two keydiscrepancies: 1) anomalous consistent performance across various questiondifficulties, and 2) high variance in performance on questions of similardifficulty. In addition, We identified inconsistent grading of LLM-generatedanswers among teachers and recurring mistake patterns. we find that thephenomenons are well-grounded in the motivations behind OpenAI o1, and o1'sreasoning-as-difficulties can mitigate the mismatch. These results show thatGAOKAO-Eval can reveal limitations in LLM capabilities not captured by currentbenchmarks and highlight the need for more LLM-aligned difficulty analysis.