Abstract
In this paper, we address the challenging task of multimodal mathematicalreasoning by incorporating the ability of ``slow thinking" into multimodallarge language models (MLLMs). Contrary to existing methods that rely on director fast thinking, our key idea is to construct long chains of thought (CoT)consisting of atomic actions in a step-by-step manner, guiding MLLMs to performcomplex reasoning. To this end, we design a novel AtomThink framework composedof three key modules: (i) a CoT annotation engine that automatically generateshigh-quality CoT annotations to address the lack of high-quality visualmathematical data; (ii) an atomic step fine-tuning strategy that jointlyoptimizes an MLLM and a policy reward model (PRM) for step-wise reasoning; and(iii) four different search strategies that can be applied with the PRM tocomplete reasoning. Additionally, we propose AtomMATH, a large-scale multimodaldataset of long CoTs, and an atomic capability evaluation metric formathematical tasks. Extensive experimental results show that the proposedAtomThink significantly improves the performance of baseline MLLMs, achievingapproximately 50\% relative accuracy gains on MathVista and 120\% on MathVerse.To support the advancement of multimodal slow-thinking models, we will make ourcode and dataset publicly available on https://github.com/Quinn777/AtomThink.