Detecting LLM-Written Peer Reviews

Abstract

Editors of academic journals and program chairs of conferences require peerreviewers to write their own reviews. However, there is growing concern aboutthe rise of lazy reviewing practices, where reviewers use large language models(LLMs) to generate reviews instead of writing them independently. Existingtools for detecting LLM-generated content are not designed to differentiatebetween fully LLM-generated reviews and those merely polished by an LLM. Inthis work, we employ a straightforward approach to identify LLM-generatedreviews - doing an indirect prompt injection via the paper PDF to ask the LLMto embed a watermark. Our focus is on presenting watermarking schemes andstatistical tests that maintain a bounded family-wise error rate, when a venueevaluates multiple reviews, with a higher power as compared to standard methodslike Bonferroni correction. These guarantees hold without relying on anyassumptions about human-written reviews. We also consider various methods forprompt injection including font embedding and jailbreaking. We evaluate theeffectiveness and various tradeoffs of these methods, including differentreviewer defenses. We find a high success rate in the embedding of ourwatermarks in LLM-generated reviews across models. We also find that ourapproach is resilient to common reviewer defenses, and that the bounds on errorrates in our statistical tests hold in practice while having the power to flagLLM-generated reviews, while Bonferroni correction is infeasible.