SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation

Abstract

We present SemHiTok, a unified image Tokenizer via Semantic-GuidedHierarchical codebook that provides consistent discrete feature representationsfor multimodal understanding and generation tasks. Recently, unified multimodallarge models (MLLMs) for understanding and generation have sparked explorationwithin research community. Previous works attempt to train a unified imagetokenizer by combining loss functions for semantic feature reconstruction andpixel reconstruction. However, due to the differing levels of featuresprioritized by multimodal understanding and generation tasks, joint trainingmethods face significant challenges in achieving a good trade-off. SemHiTokaddresses this challenge through Semantic-Guided Hierarchical codebook whichbuilds texture sub-codebooks on pre-trained semantic codebook. This designdecouples the training of semantic reconstruction and pixel reconstruction andequips the tokenizer with low-level texture feature extraction capabilitywithout degradation of high-level semantic feature extraction ability. Ourexperiments demonstrate that SemHiTok achieves excellent rFID score at256X256resolution compared to other unified tokenizers, and exhibitscompetitive performance on multimodal understanding and generation tasks.