OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

  • 2024-12-10 16:05:56
  • Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxiang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, Conghui He
  • 0

Abstract

Document content extraction is crucial in computer vision, especially formeeting the high-quality data needs of large language models (LLMs) andretrieval-augmented generation (RAG) technologies. However, current documentparsing methods suffer from significant limitations in terms of diversity andcomprehensive evaluation. To address these challenges, we introduceOmniDocBench, a novel multi-source benchmark designed to advance automateddocument content extraction. OmniDocBench includes a meticulously curated andannotated high-quality evaluation dataset comprising nine diverse documenttypes, such as academic papers, textbooks, slides, among others. Our benchmarkprovides a flexible and comprehensive evaluation framework with 19 layoutcategory labels and 14 attribute labels, enabling multi-level assessmentsacross entire datasets, individual modules, or specific data types. UsingOmniDocBench, we perform an exhaustive comparative analysis of existing modularpipelines and multimodal end-to-end methods, highlighting their limitations inhandling document diversity and ensuring fair evaluation. OmniDocBenchestablishes a robust, diverse, and fair evaluation standard for the documentcontent extraction field, offering crucial insights for future advancements andfostering the development of document parsing technologies. The codes anddataset is available in https://github.com/opendatalab/OmniDocBench.