CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models

Abstract

The remarkable performance of large language models (LLMs) in generationtasks has enabled practitioners to leverage publicly available models to powercustom applications, such as chatbots and virtual assistants. However, the dataused to train or fine-tune these LLMs is often undisclosed, allowing anattacker to compromise the data and inject backdoors into the models. In thispaper, we develop a novel inference time defense, named CLEANGEN, to mitigatebackdoor attacks for generation tasks in LLMs. CLEANGEN is a lightweight andeffective decoding strategy that is compatible with the state-of-the-art (SOTA)LLMs. Our insight behind CLEANGEN is that compared to other LLMs, backdooredLLMs assign significantly higher probabilities to tokens representing theattacker-desired contents. These discrepancies in token probabilities enableCLEANGEN to identify suspicious tokens favored by the attacker and replace themwith tokens generated by another LLM that is not compromised by the sameattacker, thereby avoiding generation of attacker-desired content. We evaluateCLEANGEN against five SOTA backdoor attacks. Our results show that CLEANGENachieves lower attack success rates (ASR) compared to five SOTA baselinedefenses for all five backdoor attacks. Moreover, LLMs deploying CLEANGENmaintain helpfulness in their responses when serving benign user queries withminimal added computational overhead.