From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

Abstract

Assessment and evaluation have long been critical challenges in artificialintelligence (AI) and natural language processing (NLP). However, traditionalmethods, whether matching-based or embedding-based, often fall short of judgingsubtle attributes and delivering satisfactory results. Recent advancements inLarge Language Models (LLMs) inspire the "LLM-as-a-judge" paradigm, where LLMsare leveraged to perform scoring, ranking, or selection across various tasksand applications. This paper provides a comprehensive survey of LLM-basedjudgment and assessment, offering an in-depth overview to advance this emergingfield. We begin by giving detailed definitions from both input and outputperspectives. Then we introduce a comprehensive taxonomy to exploreLLM-as-a-judge from three dimensions: what to judge, how to judge and where tojudge. Finally, we compile benchmarks for evaluating LLM-as-a-judge andhighlight key challenges and promising directions, aiming to provide valuableinsights and inspire future research in this promising research area. Paperlist and more resources about LLM-as-a-judge can be found at\url{https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge} and\url{https://llm-as-a-judge.github.io}.