Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates

Abstract

LLM-as-a-Judge has been widely applied to evaluate and compare different LLMalignmnet approaches (e.g., RLHF and DPO). However, concerns regarding itsreliability have emerged, due to LLM judges' biases and inconsistentdecision-making. Previous research has developed evaluation frameworks toassess reliability of LLM judges and their alignment with human preferences.However, the employed evaluation metrics often lack adequate explainability andfail to address LLM internal inconsistency. Additionally, existing studiesinadequately explore the impact of various prompt templates when applyingLLM-as-a-Judge methods, leading to potentially inconsistent comparisons betweendifferent alignment algorithms. In this work, we systematically evaluateLLM-as-a-Judge on alignment tasks by defining more theoretically interpretableevaluation metrics and explicitly mitigating LLM internal inconsistency fromreliability metrics. We develop an open-source framework to evaluate, compare,and visualize the reliability and alignment of LLM judges, which facilitatespractitioners to choose LLM judges for alignment tasks. In the experiments, weexamine effects of diverse prompt templates on LLM-judge reliability and alsodemonstrate our developed framework by comparing various LLM judges on twocommon alignment datasets (i.e., TL;DR Summarization and HH-RLHF-Helpfulness).Our results indicate a significant impact of prompt templates on LLM judgeperformance, as well as a mediocre alignment level between the tested LLMjudges and human evaluators.