LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation

Abstract

We present LlamaFusion, a framework for empowering pretrained text-only largelanguage models (LLMs) with multimodal generative capabilities, enabling themto understand and generate both text and images in arbitrary sequences.LlamaFusion leverages existing Llama-3's weights for processing textsautoregressively while introducing additional and parallel transformer modulesfor processing images with diffusion. During training, the data from eachmodality is routed to its dedicated modules: modality-specific feedforwardlayers, query-key-value projections, and normalization layers process eachmodality independently, while the shared self-attention layers allowinteractions across text and image features. By freezing the text-specificmodules and only training the image-specific modules, LlamaFusion preserves thelanguage capabilities of text-only LLMs while developing strong visualunderstanding and generation abilities. Compared to methods that pretrainmultimodal generative models from scratch, our experiments demonstrate that,LlamaFusion improves image understanding by 20% and image generation by 3.6%using only 50% of the FLOPs while maintaining Llama-3's language capabilities.We also demonstrate that this framework can adapt existing vision-languagemodels with multimodal generation ability. Overall, this framework not onlyleverages existing computational investments in text-only LLMs but also enablesthe parallel development of language and vision capabilities, presenting apromising direction for efficient multimodal model development.