Abstract
We propose the VLR-Bench, a visual question answering (VQA) benchmark forevaluating vision language models (VLMs) based on retrieval augmentedgeneration (RAG). Unlike existing evaluation datasets for externalknowledge-based VQA, the proposed VLR-Bench includes five input passages. Thisallows testing of the ability to determine which passage is useful foranswering a given query, a capability lacking in previous research. In thiscontext, we constructed a dataset of 32,000 automatically generatedinstruction-following examples, which we denote as VLR-IF. This dataset isspecifically designed to enhance the RAG capabilities of VLMs by enabling themto learn how to generate appropriate answers based on input passages. Weevaluated the validity of the proposed benchmark and training data and verifiedits performance using the state-of-the-art Llama3-based VLM, the Llava-Llama-3model. The proposed VLR-Bench and VLR-IF datasets are publicly availableonline.