Abstract
The rapid advancement of Large Language Models (LLMs) has driven theirexpanding application across various fields. One of the most promisingapplications is their role as evaluators based on natural language responses,referred to as ''LLMs-as-judges''. This framework has attracted growingattention from both academia and industry due to their excellent effectiveness,ability to generalize across tasks, and interpretability in the form of naturallanguage. This paper presents a comprehensive survey of the LLMs-as-judgesparadigm from five key perspectives: Functionality, Methodology, Applications,Meta-evaluation, and Limitations. We begin by providing a systematic definitionof LLMs-as-Judges and introduce their functionality (Why use LLM judges?). Thenwe address methodology to construct an evaluation system with LLMs (How to useLLM judges?). Additionally, we investigate the potential domains for theirapplication (Where to use LLM judges?) and discuss methods for evaluating themin various contexts (How to evaluate LLM judges?). Finally, we provide adetailed analysis of the limitations of LLM judges and discuss potential futuredirections. Through a structured and comprehensive analysis, we aim aims toprovide insights on the development and application of LLMs-as-judges in bothresearch and practice. We will continue to maintain the relevant resource listat https://github.com/CSHaitao/Awesome-LLMs-as-Judges.