Bypassing the Safety Training of Open-Source LLMs with Priming Attacks

University of Illinois Urbana-Champaign1, VMWare Research 2

*Indicates equal contribution

Abstract

Content Warning: This work contains examples of harmful language.

With the recent surge in popularity of LLMs has come an ever-increasing need for LLM safety training. In this paper, we investigate the fragility of SOTA open-source LLMs under simple, optimization-free attacks we refer to as priming attacks, which are easy to execute and effectively bypass alignment from safety training. Our proposed attack improves the Attack Success Rate on Harmful Behaviors, as measured by Llama Guard, by up to 3.3x compared to baselines.

Overview

Autoregressive Large Language Models (LLMs) have emerged as powerful conversational agents widely used in user-facing applications. To ensure that LLMs cannot be used for nefarious purposes, they are extensively safety-trained for human alignment using techniques such as RLHF. Most current methods to circumvent safety-training either requires manual effort to craft cleverly-designed "jailbreak" prompts, or requires computationally expensive optimization.

Much prior work assumes an attacker is restricted to an unchangeable prompt scaffolding format. Although this is a reasonable assumption for transferring attacks to closed-source models, it is an unnecessary one when extracting harmful behavior content from open-source models. Since some organizations have recently taken firm stances in promoting open-source LLM research (e.g. Meta), we argue that allowing unrestricted inputs should be considered an increasingly practical assumption.

In this work, we investigate a class of optimization-free methods for circumventing LLM safety training we refer to as priming attacks. Our threat model only assumes that the attacker has API query access such that they can obtain probabilities for the next token conditioned on any input. Priming attacks then append small strings of text after the prompt scaffolding to prime the model with the start of a compliant response, exploiting the autoregressive nature of LLMs to generate harmful content. An adversary can also interactively alternate between priming and generating by appending a bit of generated text to the prompt before constructing a new priming attack; we provide such a tool for attacking Llama 2 in our code. This process requires much less manual effort than traditional jailbreaks, and is significantly more efficient than optimization-based attacks.

Our investigations highlight the ease at which adversaries can coax open-source LLMs to comply with arbitrary harmful requests. We believe that this raises non-trivial points of concern for the future of open-sourcing LLMs, and we hope that our work helps promote further research into novel methods for safer open-sourcing.

Examples

The following responses were generated by safety-trained Llama-2 (7B). The user prompt is shown in blue. The priming attack is shown in orange, which is manually added to start the model response and is thus not generated. All text in black is generated by the AI. Use the selector to switch between prompts and with/without the priming attack. Note that while sometimes the model backtracks to warn the user against the action, part of the response still provides extensive harmful content. Spacing/line breaks have been added to AI responses for readability.

BibTeX

@misc{vega2023bypassing,
      title={Bypassing the Safety Training of Open-Source LLMs with Priming Attacks}, 
      author={Jason Vega and Isha Chaudhary and Changming Xu and Gagandeep Singh},
      year={2023},
      eprint={2312.12321},
      archivePrefix={arXiv},
      primaryClass={cs.CR}
}

Ethics Statement

This work contains instructions and code that can allow users to generate harmful content from open-source LLMs. Despite this, we believe it is important to share this research. Our methods are simple and have appeared in some form in previous literature and we believe it is an attack that even a non-technical end-user could discover. This is exemplified by the plethora of manual jailbreaks discovered just shortly after the popularity of LLM chatbots, with these manual jailbreaks being more complicated than the priming attacks used in this work. We hope that our work spawns further discussion and research into the safety of open-source LLMs.