SePer: Measure Retrieval Utility Through The Lens Of Semantic Perplexity Reduction

Abstract

Large Language Models (LLMs) have demonstrated improved generationperformance by incorporating externally retrieved knowledge, a process known asretrieval-augmented generation (RAG). Despite the potential of this approach,existing studies evaluate RAG effectiveness by 1) assessing retrieval andgeneration components jointly, which obscures retrieval's distinctcontribution, or 2) examining retrievers using traditional metrics such asNDCG, which creates a gap in understanding retrieval's true utility in theoverall generation process. To address the above limitations, in this work, weintroduce an automatic evaluation method that measures retrieval qualitythrough the lens of information gain within the RAG framework. Specifically, wepropose Semantic Perplexity (SePer), a metric that captures the LLM's internalbelief about the correctness of the retrieved information. We quantify theutility of retrieval by the extent to which it reduces semantic perplexitypost-retrieval. Extensive experiments demonstrate that SePer not only alignsclosely with human preferences but also offers a more precise and efficientevaluation of retrieval utility across diverse RAG scenarios.