SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage

Abstract

Large language models (LLMs) have made significant advancements acrossvarious tasks, but their safety alignment remain a major concern. Exploringjailbreak prompts can expose LLMs' vulnerabilities and guide efforts to securethem. Existing methods primarily design sophisticated instructions for the LLMto follow, or rely on multiple iterations, which could hinder the performanceand efficiency of jailbreaks. In this work, we propose a novel jailbreakparadigm, Simple Assistive Task Linkage (SATA), which can effectivelycircumvent LLM safeguards and elicit harmful responses. Specifically, SATAfirst masks harmful keywords within a malicious query to generate a relativelybenign query containing one or multiple [MASK] special tokens. It then employsa simple assistive task such as a masked language model task or an elementlookup by position task to encode the semantics of the masked keywords.Finally, SATA links the assistive task with the masked query to jointly performthe jailbreak. Extensive experiments show that SATA achieves state-of-the-artperformance and outperforms baselines by a large margin. Specifically, onAdvBench dataset, with mask language model (MLM) assistive task, SATA achievesan overall attack success rate (ASR) of 85% and harmful score (HS) of 4.57, andwith element lookup by position (ELP) assistive task, SATA attains an overallASR of 76% and HS of 4.43.