Abstract
Existing claim verification datasets often do not require systems to performcomplex reasoning or effectively interpret multimodal evidence. To addressthis, we introduce a new task: multi-hop multimodal claim verification. Thistask challenges models to reason over multiple pieces of evidence from diversesources, including text, images, and tables, and determine whether the combinedmultimodal evidence supports or refutes a given claim. To study this task, weconstruct MMCV, a large-scale dataset comprising 15k multi-hop claims pairedwith multimodal evidence, generated and refined using large language models,with additional input from human feedback. We show that MMCV is challengingeven for the latest state-of-the-art multimodal large language models,especially as the number of reasoning hops increases. Additionally, weestablish a human performance benchmark on a subset of MMCV. We hope thisdataset and its evaluation task will encourage future research in multimodalmulti-hop claim verification.