LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA

Abstract

Extractive reading comprehension question answering (QA) datasets aretypically evaluated using Exact Match (EM) and F1-score, but these metricsoften fail to fully capture model performance. With the success of largelanguage models (LLMs), they have been employed in various tasks, includingserving as judges (LLM-as-a-judge). In this paper, we reassess the performanceof QA models using LLM-as-a-judge across four reading comprehension QAdatasets. We examine different families of LLMs and various answer types toevaluate the effectiveness of LLM-as-a-judge in these tasks. Our results showthat LLM-as-a-judge is highly correlated with human judgments and can replacetraditional EM/F1 metrics. By using LLM-as-a-judge, the correlation with humanjudgments improves significantly, from 0.22 (EM) and 0.40 (F1-score) to 0.85.These findings confirm that EM and F1 metrics underestimate the trueperformance of the QA models. While LLM-as-a-judge is not perfect for moredifficult answer types (e.g., job), it still outperforms EM/F1, and we observeno bias issues, such as self-preference, when the same model is used for boththe QA and judgment tasks.