Re-Imagining Multimodal Instruction Tuning: A Representation View

Abstract

Multimodal instruction tuning has proven to be an effective strategy forachieving zero-shot generalization by fine-tuning pre-trained Large MultimodalModels (LMMs) with instruction-following data. However, as the scale of LMMscontinues to grow, fully fine-tuning these models has become highlyparameter-intensive. Although Parameter-Efficient Fine-Tuning (PEFT) methodshave been introduced to reduce the number of tunable parameters, a significantperformance gap remains compared to full fine-tuning. Furthermore, existingPEFT approaches are often highly parameterized, making them difficult tointerpret and control. In light of this, we introduce Multimodal RepresentationTuning (MRT), a novel approach that focuses on directly editing semanticallyrich multimodal representations to achieve strong performance and provideintuitive control over LMMs. Empirical results show that our method surpassescurrent state-of-the-art baselines with significant performance gains (e.g.,1580.40 MME score) while requiring substantially fewer tunable parameters(e.g., 0.03% parameters). Additionally, we conduct experiments on editinginstrumental tokens within multimodal representations, demonstrating thatdirect manipulation of these representations enables simple yet effectivecontrol over network behavior.