The HalluRAG Dataset: Detecting Closed-Domain Hallucinations in RAG Applications Using an LLM's Internal States

Abstract

Detecting hallucinations in large language models (LLMs) is critical forenhancing their reliability and trustworthiness. Most research focuses onhallucinations as deviations from information seen during training. However,the opaque nature of an LLM's parametric knowledge complicates theunderstanding of why generated texts appear ungrounded: The LLM might not havepicked up the necessary knowledge from large and often inaccessible datasets,or the information might have been changed or contradicted during furthertraining. Our focus is on hallucinations involving information not used intraining, which we determine by using recency to ensure the information emergedafter a cut-off date. This study investigates these hallucinations by detectingthem at sentence level using different internal states of various LLMs. Wepresent HalluRAG, a dataset designed to train classifiers on thesehallucinations. Depending on the model and quantization, MLPs trained onHalluRAG detect hallucinations with test accuracies ranging up to 75 %, withMistral-7B-Instruct-v0.1 achieving the highest test accuracies. Our resultsshow that IAVs detect hallucinations as effectively as CEVs and reveal thatanswerable and unanswerable prompts are encoded differently as separateclassifiers for these categories improved accuracy. However, HalluRAG showedsome limited generalizability, advocating for more diversity in datasets onhallucinations.