Abstract
As a typical and practical application of Large Language Models (LLMs),Retrieval-Augmented Generation (RAG) techniques have gained extensiveattention, particularly in vertical domains where LLMs may lack domain-specificknowledge. In this paper, we introduce an omnidirectional and automatic RAGbenchmark, OmniEval, in the financial domain. Our benchmark is characterized byits multi-dimensional evaluation framework, including (1) a matrix-based RAGscenario evaluation system that categorizes queries into five task classes and16 financial topics, leading to a structured assessment of diverse queryscenarios; (2) a multi-dimensional evaluation data generation approach, whichcombines GPT-4-based automatic generation and human annotation, achieving an87.47\% acceptance ratio in human evaluations on generated instances; (3) amulti-stage evaluation system that evaluates both retrieval and generationperformance, result in a comprehensive evaluation on the RAG pipeline; and (4)robust evaluation metrics derived from rule-based and LLM-based ones, enhancingthe reliability of assessments through manual annotations and supervisedfine-tuning of an LLM evaluator. Our experiments demonstrate thecomprehensiveness of OmniEval, which includes extensive test datasets andhighlights the performance variations of RAG systems across diverse topics andtasks, revealing significant opportunities for RAG models to improve theircapabilities in vertical domains. We open source the code of our benchmark in\href{https://github.com/RUC-NLPIR/OmniEval}{https://github.com/RUC-NLPIR/OmniEval}.