Kimi-VL Technical Report

  • 2025-04-15 17:14:37
  • Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Haotian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, Huabin Zheng, Jiaming Li, Jianlin Su, Jianzhou Wang, Jiaqi Deng, Jiezhong Qiu, Jin Xie, Jinhong Wang, Jingyuan Liu, Junjie Yan, Kun Ouyang, Liang Chen, Lin Sui, Longhui Yu, Mengfan Dong, Mengnan Dong, Nuo Xu, Pengyu Cheng, Qizheng Gu, Runjie Zhou, Shaowei Liu, Sihan Cao, Tao Yu, Tianhui Song, Tongtong Bai, Wei Song, Weiran He, Weixiao Huang, Weixin Xu, Xiaokun Yuan, Xingcheng Yao, Xingzhe Wu, Xinxing Zu, Xinyu Zhou, Xinyuan Wang, Y. Charles, Ya
  • 0

Abstract

We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE)vision-language model (VLM) that offers advanced multimodal reasoning,long-context understanding, and strong agent capabilities - all whileactivating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VLdemonstrates strong performance across challenging domains: as ageneral-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld),matching flagship models. Furthermore, it exhibits remarkable capabilitiesacross diverse challenging vision language tasks, including college-level imageand video comprehension, OCR, mathematical reasoning, and multi-imageunderstanding. In comparative evaluations, it effectively competes withcutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, andGemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL alsoadvances in processing long contexts and perceiving clearly. With a 128Kextended context window, Kimi-VL can process diverse long inputs, achievingimpressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Itsnative-resolution vision encoder, MoonViT, further allows it to see andunderstand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and34.5 on ScreenSpot-Pro, while maintaining lower computational cost for commontasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant:Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervisedfine-tuning (SFT) and reinforcement learning (RL), this model exhibits stronglong-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8on MathVision, and 71.3 on MathVista while maintaining the compact 2.8Bactivated LLM parameters, setting a new standard for efficient multimodalthinking models. Code and models are publicly accessible athttps://github.com/MoonshotAI/Kimi-VL.