Abstract
Information Retrieval (IR) methods aim to identify documents relevant to aquery, which have been widely applied in various natural language tasks.However, existing approaches typically consider only the textual content withindocuments, overlooking the fact that documents can contain multiple modalities,including images and tables. Also, they often segment each long document intomultiple discrete passages for embedding, which prevents them from capturingthe overall document context and interactions between paragraphs. To addressthese two challenges, we propose a method that holistically embeds documentsinterleaved with multiple modalities by leveraging the capability of recentvision-language models that enable the processing and integration of text,images, and tables into a unified format and representation. Moreover, tomitigate the information loss from segmenting documents into passages, insteadof representing and retrieving passages individually, we further merge therepresentations of segmented passages into one single document representation,while we additionally introduce a reranking strategy to decouple and identifythe relevant passage within the document if necessary. Then, through extensiveexperiments on diverse IR scenarios considering both the textual and multimodalqueries, we show that our approach substantially outperforms relevantbaselines, thanks to the consideration of the multimodal information withindocuments.