Abstract
Frontier models have either been language-only or have primarily focused onvision and language modalities. Although recent advancements in models withvision and audio understanding capabilities have shown substantial progress,the field lacks a standardized evaluation framework for thoroughly assessingtheir cross-modality perception performance. We introduce MAVERIX~(MultimodalAudio-Visual Evaluation Reasoning IndeX), a novel benchmark with 700 videos and2,556 questions explicitly designed to evaluate multimodal models through tasksthat necessitate close integration of video and audio information. MAVERIXuniquely provides models with audiovisual tasks, closely mimicking themultimodal perceptual experiences available to humans during inference anddecision-making processes. To our knowledge, MAVERIX is the first benchmarkaimed explicitly at assessing comprehensive audiovisual integration.Experiments with state-of-the-art models, including Gemini 1.5 Pro and o1, showperformance approaching human levels (around 70% accuracy), while human expertsreach near-ceiling performance (95.1%). With standardized evaluation protocols,a rigorously annotated pipeline, and a public toolkit, MAVERIX establishes achallenging testbed for advancing audiovisual multimodal intelligence.