Abstract
We present a novel approach to training specialized instruction-basedimage-editing diffusion models, addressing key challenges in structuralpreservation with input images and semantic alignment with user prompts. Weintroduce an online reinforcement learning framework that aligns the diffusionmodel with human preferences without relying on extensive human annotations orcurating a large dataset. Our method significantly improves the realism andalignment with instructions in two ways. First, the proposed models achieveprecise and structurally coherent modifications in complex scenes whilemaintaining high fidelity in instruction-irrelevant areas. Second, they capturefine nuances in the desired edit by leveraging a visual prompt, enablingdetailed control over visual edits without lengthy textual prompts. Thisapproach simplifies users' efforts to achieve highly specific edits, requiringonly 5 reference images depicting a certain concept for training. Experimentalresults demonstrate that our models can perform intricate edits in complexscenes, after just 10 training steps. Finally, we showcase the versatility ofour method by applying it to robotics, where enhancing the visual realism ofsimulated environments through targeted sim-to-real image edits improves theirutility as proxies for real-world settings.