A Survey on LLM-as-a-Judge

  • 2024-12-16 15:00:53
  • Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Yuanzhuo Wang, Jian Guo
  • 0

Abstract

Accurate and consistent evaluation is crucial for decision-making acrossnumerous fields, yet it remains a challenging task due to inherentsubjectivity, variability, and scale. Large Language Models (LLMs) haveachieved remarkable success across diverse domains, leading to the emergence of"LLM-as-a-Judge," where LLMs are employed as evaluators for complex tasks. Withtheir ability to process diverse data types and provide scalable,cost-effective, and consistent assessments, LLMs present a compellingalternative to traditional expert-driven evaluations. However, ensuring thereliability of LLM-as-a-Judge systems remains a significant challenge thatrequires careful design and standardization. This paper provides acomprehensive survey of LLM-as-a-Judge, addressing the core question: How canreliable LLM-as-a-Judge systems be built? We explore strategies to enhancereliability, including improving consistency, mitigating biases, and adaptingto diverse assessment scenarios. Additionally, we propose methodologies forevaluating the reliability of LLM-as-a-Judge systems, supported by a novelbenchmark designed for this purpose. To advance the development and real-worlddeployment of LLM-as-a-Judge systems, we also discussed practical applications,challenges, and future directions. This survey serves as a foundationalreference for researchers and practitioners in this rapidly evolving field.