OmniDiff: A Comprehensive Benchmark for Fine-grained Image Difference Captioning

Abstract

Image Difference Captioning (IDC) aims to generate natural languagedescriptions of subtle differences between image pairs, requiring both precisevisual change localization and coherent semantic expression. Despite recentadvancements, existing datasets often lack breadth and depth, limiting theirapplicability in complex and dynamic environments: (1) from a breadthperspective, current datasets are constrained to limited variations of objectsin specific scenes, and (2) from a depth perspective, prior benchmarks oftenprovide overly simplistic descriptions. To address these challenges, weintroduce OmniDiff, a comprehensive dataset comprising 324 diversescenarios-spanning real-world complex environments and 3D syntheticsettings-with fine-grained human annotations averaging 60 words in length andcovering 12 distinct change types. Building on this foundation, we proposeM$^3$Diff, a MultiModal large language model enhanced by a plug-and-playMulti-scale Differential Perception (MDP) module. This module improves themodel's ability to accurately identify and describe inter-image differenceswhile maintaining the foundational model's generalization capabilities. Withthe addition of the OmniDiff dataset, M$^3$Diff achieves state-of-the-artperformance across multiple benchmarks, including Spot-the-Diff, IEdit,CLEVR-Change, CLEVR-DC, and OmniDiff, demonstrating significant improvements incross-scenario difference recognition accuracy compared to existing methods.The dataset, code, and models will be made publicly available to supportfurther research.