Abstract
Image fusion integrates complementary information from multi-source images togenerate more informative results. Recently, the diffusion model, whichdemonstrates unprecedented generative potential, has been explored in imagefusion. However, these approaches typically incorporate predefined multimodalguidance into diffusion, failing to capture the dynamically changingsignificance of each modality, while lacking theoretical guarantees. To addressthis issue, we reveal a significant spatio-temporal imbalance in imagedenoising; specifically, the diffusion model produces dynamic information gainsin different image regions with denoising steps. Based on this observation, weDig into the Diffusion Information Gains (Dig2DIG) and theoretically derive adiffusion-based dynamic image fusion framework that provably reduces the upperbound of the generalization error. Accordingly, we introduce diffusioninformation gains (DIG) to quantify the information contribution of eachmodality at different denoising steps, thereby providing dynamic guidanceduring the fusion process. Extensive experiments on multiple fusion scenariosconfirm that our method outperforms existing diffusion-based approaches interms of both fusion quality and inference efficiency.