WIKIGENBENCH: Exploring Full-length Wikipedia Generation under Real-World Scenario

Abstract

It presents significant challenges to generate comprehensive and accurateWikipedia articles for newly emerging events under a real-world scenario.Existing attempts fall short either by focusing only on short snippets or byusing metrics that are insufficient to evaluate real-world scenarios. In thispaper, we construct WIKIGENBENCH, a new benchmark consisting of 1,320 entries,designed to align with real-world scenarios in both generation and evaluation.For generation, we explore a real-world scenario where structured, full-lengthWikipedia articles with citations are generated for new events using inputdocuments from web sources. For evaluation, we integrate systematic metrics andLLM-based metrics to assess the verifiability, organization, and other aspectsaligned with real-world scenarios. Based on this benchmark, we conductextensive experiments using various models within three commonly usedframeworks: direct RAG, hierarchical structure-based RAG, and RAG with afine-tuned generation model. Experimental results show that hierarchical-basedmethods can generate more comprehensive content, while fine-tuned methodsachieve better verifiability. However, even the best methods still show asignificant gap compared to existing Wikipedia content, indicating that furtherresearch is necessary.