PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation

  • 2024-12-19 18:59:44
  • Muntasir Wahed, Kiet A. Nguyen, Adheesh Sunil Juvekar, Xinzhuo Li, Xiaona Zhou, Vedant Shah, Tianjiao Yu, Pinar Yanardag, Ismini Lourentzou
  • 0

Abstract

Despite significant advancements in Large Vision-Language Models (LVLMs),existing pixel-grounding models operate on single-image settings, limitingtheir ability to perform detailed, fine-grained comparisons across multipleimages. Conversely, current multi-image understanding models lack pixel-levelgrounding. Our work addresses this gap by introducing the task of multi-imagepixel-grounded reasoning segmentation, and PRIMA, a novel LVLM that integratespixel-level grounding with robust multi-image reasoning capabilities to producecontextually rich, pixel-grounded explanations. Central to PRIMA is anefficient vision module that queries fine-grained visual representations acrossmultiple images, reducing TFLOPs by $25.3\%$. To support training andevaluation, we curate $M^4Seg$, a new reasoning segmentation benchmarkconsisting of $\sim$224K question-answer pairs that require fine-grained visualunderstanding across multiple images. Experimental results demonstrate PRIMAoutperforms state-of-the-art baselines.