How do Large Language Models stop unsafe content?
LLMs are trained to be helpful and safe assistants. During training or alignment, illegal activities, misinformation, violence, and many other topics are blocked—"I'm sorry, but I can't assist with that." Three primary attacks on LLMs to change the default behaviour still exist Prompt injection, Jailbreak, and System Message Extraction.
During training, we show the model harmful examples and teach it to respond, "I'm sorry, but I can't assist with that." What we expect is that the model will generalize this to real-live examples from users. Refusal training is widely used, and it works, but not always. This is one of the easiest jailbreaks - reformulate a request in the past tense. This simple change increases the Attack success rate (ASR) drastically - Gpt-4o from 1% to 88%, Calude-3.5-sonet from 0% to 53%. It is interesting to mention that the future tense doesn't have such an effect.
The solution to this attack is to add past tenses to a data set. We can use LLM to generate past tenses and use this synthetic data in Refusal training. If we discover a new jailbreak vector today, it will appear in new models in a few months. Before that time other methods are used to protect users.
Another attack vector is Crescendo. This is a multi-step jailbreak. The first step is to send a harmful request to the model, which, in most cases, will be rejected by the model. On each next step we slightly escalate the dialog. We can ask about the subject's history or what other people might think. Later in the dialogue, the model will lose focus and start returning harmful content.
Anthropic has published a paper on many-shot jailbreak. The idea is to add as many examples as possible to a harmful prompt. 128 examples are enough for all popular models to start showing harmful behaviour. Anthropic has implemented classification and modification of the prompt before sending it to the model. This has reduced ASR from 61% to 2%.
What we can notice is that all these attacks can be easily automated and re-use the same principles. In Crescendo, we can start asking questions about the past. Crescendo and Many-shot jailbreak exploit the long context of models. It looks like a race between attack and defence methods.
OpenAI has proposed a universal fix - The Instruction Hierarchy. The idea is to treat each message differently. For instance, the system message, which is defined by the application developer, should have the highest privilege, user message - medium and model output - low. If we train the model in this way, it will be able to align with the highest privilege instructions and sometimes be misaligned with low privilege instructions. The model trained on data with different privileges demonstrated a 30% increase against jailbreak attacks and 60% for system prompt extraction attacks.
References:
https://arxiv.org/abs/2404.13208 - The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
https://arxiv.org/abs/2404.01833v1 - Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
https://www.anthropic.com/research/many-shot-jailbreaking - Many-shot jailbreaking
https://arxiv.org/abs/2407.11969v2 - Does Refusal Training in LLMs Generalize to the Past Tense?
https://csrc.nist.gov/pubs/ai/100/2/e2023/final - Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations