Liquid: Language Models are Scalable Multi-modal Generators

Abstract

We present Liquid, an auto-regressive generation paradigm that seamlesslyintegrates visual comprehension and generation by tokenizing images intodiscrete codes and learning these code embeddings alongside text tokens withina shared feature space for both vision and language. Unlike previous multimodallarge language model (MLLM), Liquid achieves this integration using a singlelarge language model (LLM), eliminating the need for external pretrained visualembeddings such as CLIP. For the first time, Liquid uncovers a scaling law thatperformance drop unavoidably brought by the unified training of visual andlanguage tasks diminishes as the model size increases. Furthermore, the unifiedtoken space enables visual generation and comprehension tasks to mutuallyenhance each other, effectively removing the typical interference seen inearlier models. We show that existing LLMs can serve as strong foundations forLiquid, saving 100x in training costs while outperforming Chameleon inmultimodal capabilities and maintaining language performance comparable tomainstream LLMs like LLAMA2. Liquid also outperforms models like SD v2.1 andSD-XL (FID of 5.47 on MJHQ-30K), excelling in both vision-language andtext-only tasks. This work demonstrates that LLMs such as LLAMA3.2 and GEMMA2are powerful multimodal generators, offering a scalable solution for enhancingboth vision-language understanding and generation. The code and models will bereleased at https://github.com/FoundationVision/Liquid.