Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

Abstract

Diffusion models, and their generalization, flow matching, have had aremarkable impact on the field of media generation. Here, the conventionalapproach is to learn the complex mapping from a simple source distribution ofGaussian noise to the target media distribution. For cross-modal tasks such astext-to-image generation, this same mapping from noise to image is learntwhilst including a conditioning mechanism in the model. One key and thus farrelatively unexplored feature of flow matching is that, unlike Diffusionmodels, they are not constrained for the source distribution to be noise.Hence, in this paper, we propose a paradigm shift, and ask the question ofwhether we can instead train flow matching models to learn a direct mappingfrom the distribution of one modality to the distribution of another, thusobviating the need for both the noise distribution and conditioning mechanism.We present a general and simple framework, CrossFlow, for cross-modal flowmatching. We show the importance of applying Variational Encoders to the inputdata, and introduce a method to enable Classifier-free guidance. Surprisingly,for text-to-image, CrossFlow with a vanilla transformer without cross attentionslightly outperforms standard flow matching, and we show that it scales betterwith training steps and model size, while also allowing for interesting latentarithmetic which results in semantically meaningful edits in the output space.To demonstrate the generalizability of our approach, we also show thatCrossFlow is on par with or outperforms the state-of-the-art for variouscross-modal / intra-modal mapping tasks, viz. image captioning, depthestimation, and image super-resolution. We hope this paper contributes toaccelerating progress in cross-modal media generation.