Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension

Abstract

Existing large video-language models (LVLMs) struggle to comprehend longvideos correctly due to limited context. To address this problem, fine-tuninglong-context LVLMs and employing GPT-based agents have emerged as promisingsolutions. However, fine-tuning LVLMs would require extensive high-quality dataand substantial GPU resources, while GPT-based agents would rely on proprietarymodels (e.g., GPT-4o). In this paper, we propose Video Retrieval-AugmentedGeneration (Video-RAG), a training-free and cost-effective pipeline thatemploys visually-aligned auxiliary texts to help facilitate cross-modalityalignment while providing additional information beyond the visual content.Specifically, we leverage open-source external tools to extractvisually-aligned information from pure video data (e.g., audio, opticalcharacter, and object detection), and incorporate the extracted informationinto an existing LVLM as auxiliary texts, alongside video frames and queries,in a plug-and-play manner. Our Video-RAG offers several key advantages: (i)lightweight with low computing overhead due to single-turn retrieval; (ii) easyimplementation and compatibility with any LVLM; and (iii) significant,consistent performance gains across long video understanding benchmarks,including Video-MME, MLVU, and LongVideoBench. Notably, our model demonstratessuperior performance over proprietary models like Gemini-1.5-Pro and GPT-4owhen utilized with a 72B model.