Abstract
We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE)vision-language model (VLM) that offers advanced multimodal reasoning,long-context understanding, and strong agent capabilities - all whileactivating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VLdemonstrates strong performance across challenging domains: as ageneral-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld),matching flagship models. Furthermore, it exhibits remarkable capabilitiesacross diverse challenging vision language tasks, including college-level imageand video comprehension, OCR, mathematical reasoning, and multi-imageunderstanding. In comparative evaluations, it effectively competes withcutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, andGemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL alsoadvances in processing long contexts and perceiving clearly. With a 128Kextended context window, Kimi-VL can process diverse long inputs, achievingimpressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Itsnative-resolution vision encoder, MoonViT, further allows it to see andunderstand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and34.5 on ScreenSpot-Pro, while maintaining lower computational cost for commontasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant:Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervisedfine-tuning (SFT) and reinforcement learning (RL), this model exhibits stronglong-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8on MathVision, and 71.3 on MathVista while maintaining the compact 2.8Bactivated LLM parameters, setting a new standard for efficient multimodalthinking models. Code and models are publicly accessible athttps://github.com/MoonshotAI/Kimi-VL.