Abstract
With the rise of large-scale language models (LLMs), it is currently popularand effective to convert multimodal information into text descriptions formultimodal multi-hop question answering. However, we argue that the currentmethods of multi-modal multi-hop question answering still mainly face twochallenges: 1) The retrieved evidence containing a large amount of redundantinformation, inevitably leads to a significant drop in performance due toirrelevant information misleading the prediction. 2) The reasoning processwithout interpretable reasoning steps makes the model difficult to discover thelogical errors for handling complex questions. To solve these problems, wepropose a unified LLMs-based approach but without heavily relying on them dueto the LLM's potential errors, and innovatively treat multimodal multi-hopquestion answering as a joint entailment tree generation and question answeringproblem. Specifically, we design a multi-task learning framework with a focuson facilitating common knowledge sharing across interpretability and predictiontasks while preventing task-specific errors from interfering with each othervia mixture of experts. Afterward, we design an iterative feedback mechanism tofurther enhance both tasks by feeding back the results of the joint training tothe LLM for regenerating entailment trees, aiming to iteratively refine thepotential answer. Notably, our method has won the first place in the officialleaderboard of WebQA (since April 10, 2024), and achieves competitive resultson MultimodalQA.