Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey

Abstract

The rapid evolution of multimodal foundation models has led to significantadvancements in cross-modal understanding and generation across diversemodalities, including text, images, audio, and video. However, these modelsremain susceptible to jailbreak attacks, which can bypass built-in safetymechanisms and induce the production of potentially harmful content.Consequently, understanding the methods of jailbreak attacks and existingdefense mechanisms is essential to ensure the safe deployment of multimodalgenerative models in real-world scenarios, particularly in security-sensitiveapplications. To provide comprehensive insight into this topic, this surveyreviews jailbreak and defense in multimodal generative models. First, given thegeneralized lifecycle of multimodal jailbreak, we systematically exploreattacks and corresponding defense strategies across four levels: input,encoder, generator, and output. Based on this analysis, we present a detailedtaxonomy of attack methods, defense mechanisms, and evaluation frameworksspecific to multimodal generative models. Additionally, we cover a wide rangeof input-output configurations, including modalities such as Any-to-Text,Any-to-Vision, and Any-to-Any within generative systems. Finally, we highlightcurrent research challenges and propose potential directions for futureresearch. The open-source repository corresponding to this work can be found athttps://github.com/liuxuannan/Awesome-Multimodal-Jailbreak.