Abstract
The increasing prevalence of large language models (LLMs) has significantlyadvanced text generation, but the human-like quality of LLM outputs presentsmajor challenges in reliably distinguishing between human-authored andLLM-generated texts. Existing detection benchmarks are constrained by theirreliance on static datasets, scenario-specific tasks (e.g., question answeringand text refinement), and a primary focus on English, overlooking the diverselinguistic and operational subtleties of LLMs. To address these gaps, wepropose CUDRT, a comprehensive evaluation framework and bilingual benchmark inChinese and English, categorizing LLM activities into five key operations:Create, Update, Delete, Rewrite, and Translate. CUDRT provides extensivedatasets tailored to each operation, featuring outputs from state-of-the-artLLMs to assess the reliability of LLM-generated text detectors. This frameworksupports scalable, reproducible experiments and enables in-depth analysis ofhow operational diversity, multilingual training sets, and LLM architecturesinfluence detection performance. Our extensive experiments demonstrate theframework's capacity to optimize detection systems, providing critical insightsto enhance reliability, cross-linguistic adaptability, and detection accuracy.By advancing robust methodologies for identifying LLM-generated texts, thiswork contributes to the development of intelligent systems capable of meetingreal-world multilingual detection challenges. Source code and dataset areavailable at GitHub.