Abstract
Multimodal multihop question answering is a complex task that requiresreasoning over multiple sources of information, such as images and text, toanswer questions. While there has been significant progress in visual questionanswering, the multihop setting remains unexplored due to the lack ofhigh-quality datasets. Current methods focus on single-hop question answeringor a single modality, which makes them unsuitable for real-world scenarios suchas analyzing multimodal educational materials, summarizing lengthy academicarticles, or interpreting scientific studies that combine charts, images, andtext. To address this gap, we propose a novel methodology, introducing thefirst framework for creating a high-quality dataset that enables trainingmodels for multimodal multihop question answering. Our approach consists of a5-stage pipeline that involves acquiring relevant multimodal documents fromWikipedia, synthetically generating high-level questions and answers, andvalidating them through rigorous criteria to ensure quality data. We evaluateour methodology by training models on our synthesized dataset and testing ontwo benchmarks, our results demonstrate that, with an equal sample size, modelstrained on our synthesized data outperform those trained on human-collecteddata by 1.9 in exact match (EM) on average. We believe our data synthesismethod will serve as a strong foundation for training and evaluating multimodalmultihop question answering models.