Causal Diffusion Transformers for Generative Modeling

  • 2024-12-16 18:59:29
  • Chaorui Deng, Deyao Zh, Kunchang Li, Shi Guan, Haoqi Fan
  • 0

Abstract

We introduce Causal Diffusion as the autoregressive (AR) counterpart ofDiffusion models. It is a next-token(s) forecasting framework that is friendlyto both discrete and continuous modalities and compatible with existingnext-token prediction models like LLaMA and GPT. While recent works attempt tocombine diffusion with AR models, we show that introducing sequentialfactorization to a diffusion model can substantially improve its performanceand enables a smooth transition between AR and diffusion generation modes.Hence, we propose CausalFusion - a decoder-only transformer thatdual-factorizes data across sequential tokens and diffusion noise levels,leading to state-of-the-art results on the ImageNet generation benchmark whilealso enjoying the AR advantage of generating an arbitrary number of tokens forin-context reasoning. We further demonstrate CausalFusion's multimodalcapabilities through a joint image generation and captioning model, andshowcase CausalFusion's ability for zero-shot in-context image manipulations.We hope that this work could provide the community with a fresh perspective ontraining multimodal models over discrete and continuous data.