MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation

Abstract

Reasoning segmentation aims to segment target objects in complex scenes basedon human intent and spatial reasoning. While recent multimodal large languagemodels (MLLMs) have demonstrated impressive 2D image reasoning segmentation,adapting these capabilities to 3D scenes remains underexplored. In this paper,we introduce MLLM-For3D, a simple yet effective framework that transfersknowledge from 2D MLLMs to 3D scene understanding. Specifically, we utilizeMLLMs to generate multi-view pseudo segmentation masks and corresponding textembeddings, then unproject 2D masks into 3D space and align them with the textembeddings. The primary challenge lies in the absence of 3D context and spatialconsistency across multiple views, causing the model to hallucinate objectsthat do not exist and fail to target objects consistently. Training the 3Dmodel with such irrelevant objects leads to performance degradation. To addressthis, we introduce a spatial consistency strategy to enforce that segmentationmasks remain coherent in the 3D space, effectively capturing the geometry ofthe scene. Moreover, we develop a Token-for-Query approach for multimodalsemantic alignment, enabling consistent identification of the same objectacross different views. Extensive evaluations on various challenging indoorscene benchmarks demonstrate that, even without any labeled 3D training data,MLLM-For3D outperforms existing 3D reasoning segmentation methods, effectivelyinterpreting user intent, understanding 3D scenes, and reasoning about spatialrelationships.