Abstract
Large Language Models (LLMs) are increasingly used to automate relevancejudgments for information retrieval (IR) tasks, often demonstrating agreementwith human labels that approaches inter-human agreement. To assess therobustness and reliability of LLM-based relevance judgments, we systematicallyinvestigate impact of prompt sensitivity on the task. We collected prompts forrelevance assessment from 15 human experts and 15 LLMs across three tasks~ --~binary, graded, and pairwise~ -- ~yielding 90 prompts in total. Afterfiltering out unusable prompts from three humans and three LLMs, we employedthe remaining 72 prompts with three different LLMs as judges to labeldocument/query pairs from two TREC Deep Learning Datasets (2020 and 2021). Wecompare LLM-generated labels with TREC official human labels using Cohen's$\kappa$ and pairwise agreement measures. In addition to investigating theimpact of prompt variations on agreement with human labels, we compare human-and LLM-generated prompts and analyze differences among different LLMs asjudges. We also compare human- and LLM-generated prompts with the standardUMBRELA prompt used for relevance assessment by Bing and TREC 2024 RetrievalAugmented Generation (RAG) Track. To support future research in LLM-basedevaluation, we release all data and prompts athttps://github.com/Narabzad/prompt-sensitivity-relevance-judgements/.