Abstract
Document content extraction is crucial in computer vision, especially formeeting the high-quality data needs of large language models (LLMs) andretrieval-augmented generation (RAG) technologies. However, current documentparsing methods suffer from significant limitations in terms of diversity andcomprehensive evaluation. To address these challenges, we introduceOmniDocBench, a novel multi-source benchmark designed to advance automateddocument content extraction. OmniDocBench includes a meticulously curated andannotated high-quality evaluation dataset comprising nine diverse documenttypes, such as academic papers, textbooks, slides, among others. Our benchmarkprovides a flexible and comprehensive evaluation framework with 19 layoutcategory labels and 14 attribute labels, enabling multi-level assessmentsacross entire datasets, individual modules, or specific data types. UsingOmniDocBench, we perform an exhaustive comparative analysis of existing modularpipelines and multimodal end-to-end methods, highlighting their limitations inhandling document diversity and ensuring fair evaluation. OmniDocBenchestablishes a robust, diverse, and fair evaluation standard for the documentcontent extraction field, offering crucial insights for future advancements andfostering the development of document parsing technologies. The codes anddataset is available in https://github.com/opendatalab/OmniDocBench.