Abstract
Large language model (LLM) agents are increasingly employingretrieval-augmented generation (RAG) to improve the factuality of theirresponses. However, in practice, these systems often need to handle ambiguoususer queries and potentially conflicting information from multiple sourceswhile also suppressing inaccurate information from noisy or irrelevantdocuments. Prior work has generally studied and addressed these challenges inisolation, considering only one aspect at a time, such as handling ambiguity orrobustness to noise and misinformation. We instead consider multiple factorssimultaneously, proposing (i) RAMDocs (Retrieval with Ambiguity andMisinformation in Documents), a new dataset that simulates complex andrealistic scenarios for conflicting evidence for a user query, includingambiguity, misinformation, and noise; and (ii) MADAM-RAG, a multi-agentapproach in which LLM agents debate over the merits of an answer over multiplerounds, allowing an aggregator to collate responses corresponding todisambiguated entities while discarding misinformation and noise, therebyhandling diverse sources of conflict jointly. We demonstrate the effectivenessof MADAM-RAG using both closed and open-source models on AmbigDocs -- whichrequires presenting all valid answers for ambiguous queries -- improving overstrong RAG baselines by up to 11.40% and on FaithEval -- which requiressuppressing misinformation -- where we improve by up to 15.80% (absolute) withLlama3.3-70B-Instruct. Furthermore, we find that RAMDocs poses a challenge forexisting RAG baselines (Llama3.3-70B-Instruct only obtains 32.60 exact matchscore). While MADAM-RAG begins to address these conflicting factors, ouranalysis indicates that a substantial gap remains especially when increasingthe level of imbalance in supporting evidence and misinformation.