Paint by Inpaint: Learning to Add Image Objects by Removing Them First

Abstract

Image editing has advanced significantly with the introduction oftext-conditioned diffusion models. Despite this progress, seamlessly addingobjects to images based on textual instructions without requiring user-providedinput masks remains a challenge. We address this by leveraging the insight thatremoving objects (Inpaint) is significantly simpler than its inverse process ofadding them (Paint), attributed to inpainting models that benefit fromsegmentation mask guidance. Capitalizing on this realization, by implementingan automated and extensive pipeline, we curate a filtered large-scale imagedataset containing pairs of images and their corresponding object-removedversions. Using these pairs, we train a diffusion model to inverse theinpainting process, effectively adding objects into images. Unlike otherediting datasets, ours features natural target images instead of synthetic oneswhile ensuring source-target consistency by construction. Additionally, weutilize a large Vision-Language Model to provide detailed descriptions of theremoved objects and a Large Language Model to convert these descriptions intodiverse, natural-language instructions. Our quantitative and qualitativeresults show that the trained model surpasses existing models in both objectaddition and general editing tasks. Visit our project page for the releaseddataset and trained models: https://rotsteinnoam.github.io/Paint-by-Inpaint.