Abstract
Large Language Models (LLMs) have been widely used in various tasks,motivating us to develop an LLM-based assistant for videos. Instead of trainingfrom scratch, we propose a module to transform arbitrary well-trainedimage-based LLMs into video-LLMs (after being trained on video data). To betteradapt image-LLMs for processing videos, we introduce two design principles:linear transformation to preserve the original visual-language alignment andrepresentative information condensation from redundant video content. Guided bythese principles, we propose a plug-and-play Linear Video Tokenizer(LinVT),which enables existing image-LLMs to understand videos. We benchmark LinVT withsix recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL,showcasing the high compatibility of LinVT. LinVT-based LLMs achievestate-of-the-art performance across various video benchmarks, illustrating theeffectiveness of LinVT in multi-modal video understanding.