Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails

Abstract

Large Language Models (LLMs) guardrail systems are designed to protectagainst prompt injection and jailbreak attacks. However, they remain vulnerableto evasion techniques. We demonstrate two approaches for bypassing LLM promptinjection and jailbreak detection systems via traditional character injectionmethods and algorithmic Adversarial Machine Learning (AML) evasion techniques.Through testing against six prominent protection systems, including Microsoft'sAzure Prompt Shield and Meta's Prompt Guard, we show that both methods can beused to evade detection while maintaining adversarial utility achieving in someinstances up to 100% evasion success. Furthermore, we demonstrate thatadversaries can enhance Attack Success Rates (ASR) against black-box targets byleveraging word importance ranking computed by offline white-box models. Ourfindings reveal vulnerabilities within current LLM protection mechanisms andhighlight the need for more robust guardrail systems.