Grounded Chain-of-Thought for Multimodal Large Language Models

Abstract

Despite great progress, existing multimodal large language models (MLLMs) areprone to visual hallucination, greatly impeding their trustworthy applications.In this paper, we study this problem from the perspective of visual-spatialreasoning, and propose a new learning task for MLLMs, termed GroundedChain-of-Thought (GCoT). Different from recent visual CoT studies, which focusmore on visual knowledge reasoning, GCoT is keen to helping MLLMs to recognizeand ground the relevant visual cues step by step, thereby predicting thecorrect answer with grounding coordinates as the intuitive basis. To facilitatethis task, we also carefully design and construct a dataset called multimodalgrounded chain-of-thought (MM-GCoT) consisting of 24,022 GCoT examples for5,033 images. Besides, a comprehensive consistency evaluation system is alsointroduced, including the metrics of answer accuracy, grounding accuracy andanswer-grounding consistency. We further design and conduct a bunch ofexperiments on 12 advanced MLLMs, and reveal some notable findings: i. mostMLLMs performs poorly on the consistency evaluation, indicating obvious visualhallucination; ii. visual hallucination is not directly related to theparameter size and general multimodal performance, i.e., a larger and strongerMLLM is not less affected by this issue. Lastly, we also demonstrate that theproposed dataset can help existing MLLMs to well cultivate their GCoTcapability and reduce the inconsistent answering significantly. Moreover, theirGCoT can be also generalized to exiting multimodal tasks, such as open-world QAand REC.