Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

Abstract

Recent advancements in Chain of Thought (COT) generation have significantlyimproved the reasoning capabilities of Large Language Models (LLMs), withreinforcement learning (RL) emerging as an effective post-training approach.Multimodal Large Language Models (MLLMs) inherit this reasoning potential butremain underexplored in tasks requiring both perception and logical reasoning.To address this, we introduce SEED-Bench-R1, a benchmark designed tosystematically evaluate post-training methods for MLLMs in video understanding.It includes intricate real-world videos and complex everyday planning tasks inthe format of multiple-choice questions, requiring sophisticated perception andreasoning. SEED-Bench-R1 assesses generalization through a three-levelhierarchy: in-distribution, cross-environment, and cross-environment-taskscenarios, equipped with a large-scale training dataset with easily verifiableground-truth answers. Using Qwen2-VL-Instruct-7B as a base model, we compare RLwith supervised fine-tuning (SFT), demonstrating RL's data efficiency andsuperior performance on both in-distribution and out-of-distribution tasks,even outperforming SFT on general video understanding benchmarks likeLongVideoBench. Our detailed analysis reveals that RL enhances visualperception but often produces less logically coherent reasoning chains. Weidentify key limitations such as inconsistent reasoning and overlooked visualcues, and suggest future improvements in base model reasoning, reward modeling,and RL robustness against noisy signals.