CUE-M: Contextual Understanding and Enhanced Search with Multimodal Large Language Model

Abstract

The integration of Retrieval-Augmented Generation (RAG) with Multimodal LargeLanguage Models (MLLMs) has revolutionized information retrieval and expandedthe practical applications of AI. However, current systems struggle inaccurately interpreting user intent, employing diverse retrieval strategies,and effectively filtering unintended or inappropriate responses, limiting theireffectiveness. This paper introduces Contextual Understanding and EnhancedSearch with MLLM (CUE-M), a novel multimodal search framework that addressesthese challenges through a multi-stage pipeline comprising image contextenrichment, intent refinement, contextual query generation, external APIintegration, and relevance-based filtering. CUE-M incorporates a robustfiltering pipeline combining image-based, text-based, and multimodalclassifiers, dynamically adapting to instance- and category-specific concerndefined by organizational policies. Extensive experiments on real-word datasetsand public benchmarks on knowledge-based VQA and safety demonstrated that CUE-Moutperforms baselines and establishes new state-of-the-art results, advancingthe capabilities of multimodal retrieval systems.