Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Abstract

In this report, we introduce the Gemini 1.5 family of models, representingthe next generation of highly compute-efficient multimodal models capable ofrecalling and reasoning over fine-grained information from millions of tokensof context, including multiple long documents and hours of video and audio. Thefamily includes two new models: (1) an updated Gemini 1.5 Pro, which exceedsthe February version on the great majority of capabilities and benchmarks; (2)Gemini 1.5 Flash, a more lightweight variant designed for efficiency withminimal regression in quality. Gemini 1.5 models achieve near-perfect recall onlong-context retrieval tasks across modalities, improve the state-of-the-art inlong-document QA, long-video QA and long-context ASR, and match or surpassGemini 1.0 Ultra's state-of-the-art performance across a broad set ofbenchmarks. Studying the limits of Gemini 1.5's long-context ability, we findcontinued improvement in next-token prediction and near-perfect retrieval(>99%) up to at least 10M tokens, a generational leap over existing models suchas Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-worlduse cases, such as Gemini 1.5 collaborating with professionals on completingtheir tasks achieving 26 to 75% time savings across 10 different jobcategories, as well as surprising new capabilities of large language models atthe frontier; when given a grammar manual for Kalamang, a language with fewerthan 200 speakers worldwide, the model learns to translate English to Kalamangat a similar level to a person who learned from the same content.