V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Abstract

Vision-Language Models (VLMs) have shown promising capabilities in handlingvarious multimodal tasks, yet they struggle in long-context scenarios,particularly in tasks involving videos, high-resolution images, or lengthyimage-text documents. In our work, we first conduct an empirical analysis ofthe long-context capabilities of VLMs using our augmented long-contextmultimodal datasets. Our findings reveal that directly applying the positionalencoding mechanism used for textual tokens to visual tokens is suboptimal, andVLM performance degrades sharply when the position encoding exceeds the model'scontext window. To address this, we propose Variable Visual Position Encoding(V2PE), a novel positional encoding approach that employs variable and smallerincrements for visual tokens, enabling more efficient management of longmultimodal sequences. Our experiments demonstrate the effectiveness of V2PE toenhances VLMs' ability to effectively understand and reason over longmultimodal contexts. We further integrate V2PE with our augmented long-contextmultimodal datasets to fine-tune the open-source VLM, InternVL2. The fine-tunedmodel achieves strong performance on both standard and long-context multimodaltasks. Notably, when the sequence length of the training dataset is increasedto 256K tokens, the model is capable of processing multimodal sequences up to1M tokens, highlighting its potential for real-world long-context applications.