Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Abstract

By extending the advantage of chain-of-thought (CoT) reasoning in human-likestep-by-step processes to multimodal contexts, multimodal CoT (MCoT) reasoninghas recently garnered significant research attention, especially in theintegration with multimodal large language models (MLLMs). Existing MCoTstudies design various methodologies and innovative reasoning paradigms toaddress the unique challenges of image, video, speech, audio, 3D, andstructured data across different modalities, achieving extensive success inapplications such as robotics, healthcare, autonomous driving, and multimodalgeneration. However, MCoT still presents distinct challenges and opportunitiesthat require further focus to ensure consistent thriving in this field, where,unfortunately, an up-to-date review of this domain is lacking. To bridge thisgap, we present the first systematic survey of MCoT reasoning, elucidating therelevant foundational concepts and definitions. We offer a comprehensivetaxonomy and an in-depth analysis of current methodologies from diverseperspectives across various application scenarios. Furthermore, we provideinsights into existing challenges and future research directions, aiming tofoster innovation toward multimodal AGI.