Abstract
Recent advances in Multi-Modal Large Language Models (M-LLMs) show promisingresults in video reasoning. Popular Multi-Modal Large Language Model (M-LLM)frameworks usually apply naive uniform sampling to reduce the number of videoframes that are fed into an M-LLM, particularly for long context videos.However, it could lose crucial context in certain periods of a video, so thatthe downstream M-LLM may not have sufficient visual information to answer aquestion. To attack this pain point, we propose a light-weight M-LLM -basedframe selection method that adaptively select frames that are more relevant tousers' queries. In order to train the proposed frame selector, we introduce twosupervision signals (i) Spatial signal, where single frame importance score byprompting a M-LLM; (ii) Temporal signal, in which multiple frames selection byprompting Large Language Model (LLM) using the captions of all framecandidates. The selected frames are then digested by a frozen downstream videoM-LLM for visual reasoning and question answering. Empirical results show thatthe proposed M-LLM video frame selector improves the performances variousdownstream video Large Language Model (video-LLM) across medium (ActivityNet,NExT-QA) and long (EgoSchema, LongVideoBench) context video question answeringbenchmarks.