Abstract
Large Language Models (LLMs) have demonstrated remarkable proficiency ingenerating code. However, the misuse of LLM-generated (synthetic) code hasraised concerns in both educational and industrial contexts, underscoring theurgent need for synthetic code detectors. Existing methods for detectingsynthetic content are primarily designed for general text and struggle withcode due to the unique grammatical structure of programming languages and thepresence of numerous ''low-entropy'' tokens. Building on this, our workproposes a novel zero-shot synthetic code detector based on the similaritybetween the original code and its LLM-rewritten variants. Our method is basedon the observation that differences between LLM-rewritten and original codetend to be smaller when the original code is synthetic. We utilizeself-supervised contrastive learning to train a code similarity model andevaluate our approach on two synthetic code detection benchmarks. Our resultsdemonstrate a significant improvement over existing SOTA synthetic contentdetectors, with AUROC scores increasing by 20.5% on the APPS benchmark and29.1% on the MBPP benchmark.