Abstract
Large Language Models (LLMs) have significantly enhanced the capabilities ofinformation access systems, especially with retrieval-augmented generation(RAG). Nevertheless, the evaluation of RAG systems remains a barrier tocontinued progress, a challenge we tackle in this work by proposing anautomatic evaluation framework that is validated against human annotations. Webelieve that the nugget evaluation methodology provides a solid foundation forevaluating RAG systems. This approach, originally developed for the TRECQuestion Answering (QA) Track in 2003, evaluates systems based on atomic factsthat should be present in good answers. Our efforts focus on "refactoring" thismethodology, where we describe the AutoNuggetizer framework that specificallyapplies LLMs to both automatically create nuggets and automatically assignnuggets to system answers. In the context of the TREC 2024 RAG Track, wecalibrate a fully automatic approach against strategies where nuggets arecreated manually or semi-manually by human assessors and then assigned manuallyto system answers. Based on results from a community-wide evaluation, weobserve strong agreement at the run level between scores derived from fullyautomatic nugget evaluation and human-based variants. The agreement is strongerwhen individual framework components such as nugget assignment are automatedindependently. This suggests that our evaluation framework provides tradeoffsbetween effort and quality that can be used to guide the development of futureRAG systems. However, further research is necessary to refine our approach,particularly in establishing robust per-topic agreement to diagnose systemfailures effectively.