"Moralized" Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in Large Language Models for Verbal Attacks

Abstract

As the application of large language models continues to expand in variousfields, it poses higher challenges to the effectiveness of identifying harmfulcontent generation and guardrail mechanisms. This research aims to evaluate theguardrail effectiveness of GPT-4o, Grok-2 Beta, Llama 3.1 (405B), Gemini 1.5,and Claude 3.5 Sonnet through black-box testing of seemingly ethical multi-stepjailbreak prompts. It conducts ethical attacks by designing an identicalmulti-step prompts that simulates the scenario of "corporate middle managerscompeting for promotions." The data results show that the guardrails of theabove-mentioned LLMs were bypassed and the content of verbal attacks wasgenerated. Claude 3.5 Sonnet's resistance to multi-step jailbreak prompts ismore obvious. To ensure objectivity, the experimental process, black box testcode, and enhanced guardrail code are uploaded to the GitHub repository:https://github.com/brucewang123456789/GeniusTrail.git.