Abstract
Recent advancements in image editing have utilized large-scale multimodalmodels to enable intuitive, natural instruction-driven interactions. However,conventional methods still face significant challenges, particularly in spatialreasoning, precise region segmentation, and maintaining semantic consistency,especially in complex scenes. To overcome these challenges, we introduceSmartFreeEdit, a novel end-to-end framework that integrates a multimodal largelanguage model (MLLM) with a hypergraph-enhanced inpainting architecture,enabling precise, mask-free image editing guided exclusively by naturallanguage instructions. The key innovations of SmartFreeEdit include:(1)theintroduction of region aware tokens and a mask embedding paradigm that enhancethe spatial understanding of complex scenes;(2) a reasoning segmentationpipeline designed to optimize the generation of editing masks based on naturallanguage instructions;and (3) a hypergraph-augmented inpainting module thatensures the preservation of both structural integrity and semantic coherenceduring complex edits, overcoming the limitations of local-based imagegeneration. Extensive experiments on the Reason-Edit benchmark demonstrate thatSmartFreeEdit surpasses current state-of-the-art methods across multipleevaluation metrics, including segmentation accuracy, instruction adherence, andvisual quality preservation, while addressing the issue of local informationfocus and improving global consistency in the edited image. Our project will beavailable at https://github.com/smileformylove/SmartFreeEdit.