XFormParser: A Simple and Effective Multimodal Multilingual Semi-structured Form Parser

Abstract

In the domain of Document AI, parsing semi-structured image form is a crucialKey Information Extraction (KIE) task. The advent of pre-trained multimodalmodels significantly empowers Document AI frameworks to extract key informationfrom form documents in different formats such as PDF, Word, and images.Nonetheless, form parsing is still encumbered by notable challenges like subparcapabilities in multilingual parsing and diminished recall in industrialcontexts in rich text and rich visuals. In this work, we introduce a simple buteffective \textbf{M}ultimodal and \textbf{M}ultilingual semi-structured\textbf{FORM} \textbf{PARSER} (\textbf{XFormParser}), which anchored on acomprehensive Transformer-based pre-trained language model and innovativelyamalgamates semantic entity recognition (SER) and relation extraction (RE) intoa unified framework. Combined with Bi-LSTM, the performance of multilingualparsing is significantly improved. Furthermore, we develop InDFormSFT, apioneering supervised fine-tuning (SFT) industrial dataset that specificallyaddresses the parsing needs of forms in various industrial contexts.XFormParser has demonstrated its unparalleled effectiveness and robustnessthrough rigorous testing on established benchmarks. Compared to existingstate-of-the-art (SOTA) models, XFormParser notably achieves up to 1.79\% F1score improvement on RE tasks in language-specific settings. It also exhibitsexceptional cross-task performance improvements in multilingual and zero-shotsettings. The codes, datasets, and pre-trained models are publicly available athttps://github.com/zhbuaa0/xformparser.