Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?

Abstract

Large Language Models (LLMs) are known to be susceptible to craftedadversarial attacks or jailbreaks that lead to the generation of objectionablecontent despite being aligned to human preferences using safety fine-tuningmethods. While the large dimensionality of input token space makes itinevitable to find adversarial prompts that can jailbreak these models, we aimto evaluate whether safety fine-tuned LLMs are safe against natural promptswhich are semantically related to toxic seed prompts that elicit safe responsesafter alignment. We surprisingly find that popular aligned LLMs such as GPT-4can be compromised using naive prompts that are NOT even crafted with anobjective of jailbreaking the model. Furthermore, we empirically show thatgiven a seed prompt that elicits a toxic response from an unaligned model, onecan systematically generate several semantically related natural prompts thatcan jailbreak aligned LLMs. Towards this, we propose a method of ResponseGuided Question Augmentation (ReG-QA) to evaluate the generalization of safetyaligned LLMs to natural prompts, that first generates several toxic answersgiven a seed question using an unaligned LLM (Q to A), and further leverages anLLM to generate questions that are likely to produce these answers (A to Q). Weinterestingly find that safety fine-tuned LLMs such as GPT-4o are vulnerable toproducing natural jailbreak questions from unsafe content (without denial) andcan thus be used for the latter (A to Q) step. We obtain attack success ratesthat are comparable to/ better than leading adversarial attack methods on theJailbreakBench leaderboard, while being significantly more stable againstdefenses such as Smooth-LLM and Synonym Substitution, which are effectiveagainst existing all attacks on the leaderboard.