Abstract
We introduce Causal Diffusion as the autoregressive (AR) counterpart ofDiffusion models. It is a next-token(s) forecasting framework that is friendlyto both discrete and continuous modalities and compatible with existingnext-token prediction models like LLaMA and GPT. While recent works attempt tocombine diffusion with AR models, we show that introducing sequentialfactorization to a diffusion model can substantially improve its performanceand enables a smooth transition between AR and diffusion generation modes.Hence, we propose CausalFusion - a decoder-only transformer thatdual-factorizes data across sequential tokens and diffusion noise levels,leading to state-of-the-art results on the ImageNet generation benchmark whilealso enjoying the AR advantage of generating an arbitrary number of tokens forin-context reasoning. We further demonstrate CausalFusion's multimodalcapabilities through a joint image generation and captioning model, andshowcase CausalFusion's ability for zero-shot in-context image manipulations.We hope that this work could provide the community with a fresh perspective ontraining multimodal models over discrete and continuous data.