Abstract
How can we enable models to comprehend video anomalies occurring over varyingtemporal scales and contexts? Traditional Video Anomaly Understanding (VAU)methods focus on frame-level anomaly prediction, often missing theinterpretability of complex and diverse real-world anomalies. Recent multimodalapproaches leverage visual and textual data but lack hierarchical annotationsthat capture both short-term and long-term anomalies. To address thischallenge, we introduce HIVAU-70k, a large-scale benchmark for hierarchicalvideo anomaly understanding across any granularity. We develop a semi-automatedannotation engine that efficiently scales high-quality annotations by combiningmanual video segmentation with recursive free-text annotation using largelanguage models (LLMs). This results in over 70,000 multi-granular annotationsorganized at clip-level, event-level, and video-level segments. For efficientanomaly detection in long videos, we propose the Anomaly-focused TemporalSampler (ATS). ATS integrates an anomaly scorer with a density-aware sampler toadaptively select frames based on anomaly scores, ensuring that the multimodalLLM concentrates on anomaly-rich regions, which significantly enhances bothefficiency and accuracy. Extensive experiments demonstrate that ourhierarchical instruction data markedly improves anomaly comprehension. Theintegrated ATS and visual-language model outperform traditional methods inprocessing long videos. Our benchmark and model are publicly available athttps://github.com/pipixin321/HolmesVAU.