LEGION: Learning to Ground and Explain for Synthetic Image Detection

Abstract

The rapid advancements in generative technology have emerged as adouble-edged sword. While offering powerful tools that enhance convenience,they also pose significant social concerns. As defenders, current syntheticimage detection methods often lack artifact-level textual interpretability andare overly focused on image manipulation detection, and current datasetsusually suffer from outdated generators and a lack of fine-grained annotations.In this paper, we introduce SynthScars, a high-quality and diverse datasetconsisting of 12,236 fully synthetic images with human-expert annotations. Itfeatures 4 distinct image content types, 3 categories of artifacts, andfine-grained annotations covering pixel-level segmentation, detailed textualexplanations, and artifact category labels. Furthermore, we propose LEGION(LEarning to Ground and explain for Synthetic Image detectiON), a multimodallarge language model (MLLM)-based image forgery analysis framework thatintegrates artifact detection, segmentation, and explanation. Building uponthis capability, we further explore LEGION as a controller, integrating it intoimage refinement pipelines to guide the generation of higher-quality and morerealistic images. Extensive experiments show that LEGION outperforms existingmethods across multiple benchmarks, particularly surpassing the second-besttraditional expert on SynthScars by 3.31% in mIoU and 7.75% in F1 score.Moreover, the refined images generated under its guidance exhibit strongeralignment with human preferences. The code, model, and dataset will bereleased.