Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

  • 2024-12-16 17:39:39
  • Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love, Paul Voigtlaender, Rohan Jain, Gabriela Surita, Kareem Mohamed, Rory Blevins, Junwhan Ahn, Tao Zhu, Kornraphop Kawintiranon, Orhan Firat, Yiming Gu, Yujing Zhang, Matthew Rahtz, Manaal Faruqui, Natalie Clay, Justin Gilmer, JD Co-Reyes, Ivo Penchev, Rui Zhu, Nobuyuki Morioka, Kevin Hui, Krishna Haridasan, Victor Campos, Mahdis Mahdieh, Mandy Guo, Samer Hassan, Kevin Kilgour, Arpi Vezer, Heng-Tze Cheng, Raoul de Liedekerke, Si
  • 0

Abstract

In this report, we introduce the Gemini 1.5 family of models, representingthe next generation of highly compute-efficient multimodal models capable ofrecalling and reasoning over fine-grained information from millions of tokensof context, including multiple long documents and hours of video and audio. Thefamily includes two new models: (1) an updated Gemini 1.5 Pro, which exceedsthe February version on the great majority of capabilities and benchmarks; (2)Gemini 1.5 Flash, a more lightweight variant designed for efficiency withminimal regression in quality. Gemini 1.5 models achieve near-perfect recall onlong-context retrieval tasks across modalities, improve the state-of-the-art inlong-document QA, long-video QA and long-context ASR, and match or surpassGemini 1.0 Ultra's state-of-the-art performance across a broad set ofbenchmarks. Studying the limits of Gemini 1.5's long-context ability, we findcontinued improvement in next-token prediction and near-perfect retrieval(>99%) up to at least 10M tokens, a generational leap over existing models suchas Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-worlduse cases, such as Gemini 1.5 collaborating with professionals on completingtheir tasks achieving 26 to 75% time savings across 10 different jobcategories, as well as surprising new capabilities of large language models atthe frontier; when given a grammar manual for Kalamang, a language with fewerthan 200 speakers worldwide, the model learns to translate English to Kalamangat a similar level to a person who learned from the same content.