Abstract
Indexing is an important step towards strong performance inretrieval-augmented generation (RAG) systems. However, existing methodsorganize data based on either semantic similarity (similarity) or relatedinformation (relatedness), but do not cover both perspectives comprehensively.Our analysis reveals that modeling only one perspective results in insufficientknowledge synthesis, leading to suboptimal performance on complex tasksrequiring multihop reasoning. In this paper, we propose SiReRAG, a novel RAGindexing approach that explicitly considers both similar and relatedinformation. On the similarity side, we follow existing work and explore somevariances to construct a similarity tree based on recursive summarization. Onthe relatedness side, SiReRAG extracts propositions and entities from texts,groups propositions via shared entities, and generates recursive summaries toconstruct a relatedness tree. We index and flatten both similarity andrelatedness trees into a unified retrieval pool. Our experiments demonstratethat SiReRAG consistently outperforms state-of-the-art indexing methods onthree multihop datasets (MuSiQue, 2WikiMultiHopQA, and HotpotQA), with anaverage 1.9% improvement in F1 scores. As a reasonably efficient solution,SiReRAG enhances existing reranking methods significantly, with up to 7.8%improvement in average F1 scores.