Abstract
We present MultiChallenge, a pioneering benchmark evaluating large languagemodels (LLMs) on conducting multi-turn conversations with human users, acrucial yet underexamined capability for their applications. MultiChallengeidentifies four categories of challenges in multi-turn conversations that arenot only common and realistic among current human-LLM interactions, but arealso challenging to all current frontier LLMs. All 4 challenges requireaccurate instruction-following, context allocation, and in-context reasoning atthe same time. We also develop LLM as judge with instance-level rubrics tofacilitate an automatic evaluation method with fair agreement with experiencedhuman raters. Despite achieving near-perfect scores on existing multi-turnevaluation benchmarks, all frontier models have less than 50% accuracy onMultiChallenge, with the top-performing Claude 3.5 Sonnet (June 2024) achievingjust a 41.4% average accuracy.