Towards Lighter and Robust Evaluation for Retrieval Augmented Generation

Abstract

Large Language Models are prompting us to view more NLP tasks from agenerative perspective. At the same time, they offer a new way of accessinginformation, mainly through the RAG framework. While there have been notableimprovements for the autoregressive models, overcoming hallucination in thegenerated answers remains a continuous problem. A standard solution is to usecommercial LLMs, such as GPT4, to evaluate these algorithms. However, suchframeworks are expensive and not very transparent. Therefore, we propose astudy which demonstrates the interest of open-weight models for evaluating RAGhallucination. We develop a lightweight approach using smaller, quantized LLMsto provide an accessible and interpretable metric that gives continuous scoresfor the generated answer with respect to their correctness and faithfulness.This score allows us to question decisions' reliability and explore thresholdsto develop a new AUC metric as an alternative to correlation with humanjudgment.