HyViLM: Enhancing Fine-Grained Recognition with a Hybrid Encoder for Vision-Language Models

Abstract

Recently, there has been growing interest in the capability of multimodallarge language models (MLLMs) to process high-resolution images. A commonapproach currently involves dynamically cropping the original high-resolutionimage into smaller sub-images, which are then fed into a vision encoder thatwas pre-trained on lower-resolution images. However, this cropping approachoften truncates objects and connected areas in the original image, causingsemantic breaks. To address this limitation, we introduce HyViLM, designed toprocess images of any resolution while retaining the overall context duringencoding. Specifically, we: (i) Design a new visual encoder called HybridEncoder that not only encodes individual sub-images but also interacts withdetailed global visual features, significantly improving the model's ability toencode high-resolution images. (ii) Propose an optimal feature fusion strategyfor the dynamic cropping approach, effectively leveraging information fromdifferent layers of the vision encoder. Compared with the state-of-the-artMLLMs under the same setting, our HyViLM outperforms existing MLLMs in nine outof ten tasks. Specifically, HyViLM achieves a 9.6% improvement in performanceon the TextVQA task and a 6.9% enhancement on the DocVQA task.