MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs

  • 2025-03-06 04:41:56
  • Ved Sirdeshmukh, Kaustubh Deshpande, Johannes Mols, Lifeng Jin, Ed-Yeremai Cardona, Dean Lee, Jeremy Kritz, Willow Primack, Summer Yue, Chen Xing
  • 0

Abstract

We present MultiChallenge, a pioneering benchmark evaluating large languagemodels (LLMs) on conducting multi-turn conversations with human users, acrucial yet underexamined capability for their applications. MultiChallengeidentifies four categories of challenges in multi-turn conversations that arenot only common and realistic among current human-LLM interactions, but arealso challenging to all current frontier LLMs. All 4 challenges requireaccurate instruction-following, context allocation, and in-context reasoning atthe same time. We also develop LLM as judge with instance-level rubrics tofacilitate an automatic evaluation method with fair agreement with experiencedhuman raters. Despite achieving near-perfect scores on existing multi-turnevaluation benchmarks, all frontier models have less than 50% accuracy onMultiChallenge, with the top-performing Claude 3.5 Sonnet (June 2024) achievingjust a 41.4% average accuracy.