Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey

Abstract

This survey examines evaluation methods for large language model (LLM)-basedagents in multi-turn conversational settings. Using a PRISMA-inspiredframework, we systematically reviewed nearly 250 scholarly sources, capturingthe state of the art from various venues of publication, and establishing asolid foundation for our analysis. Our study offers a structured approach bydeveloping two interrelated taxonomy systems: one that defines \emph{what toevaluate} and another that explains \emph{how to evaluate}. The first taxonomyidentifies key components of LLM-based agents for multi-turn conversations andtheir evaluation dimensions, including task completion, response quality, userexperience, memory and context retention, as well as planning and toolintegration. These components ensure that the performance of conversationalagents is assessed in a holistic and meaningful manner. The second taxonomysystem focuses on the evaluation methodologies. It categorizes approaches intoannotation-based evaluations, automated metrics, hybrid strategies that combinehuman assessments with quantitative measures, and self-judging methodsutilizing LLMs. This framework not only captures traditional metrics derivedfrom language understanding, such as BLEU and ROUGE scores, but alsoincorporates advanced techniques that reflect the dynamic, interactive natureof multi-turn dialogues.