This probe requires an attacker LLM client to generate the conversational prompts.
Purpose
The Crescendo Probe is designed to:- Generate advanced test cases for evaluating model safety and alignment.
- Simulate sophisticated red-teaming scenarios where an attacker uses gradual persuasion.
- Test a model’s robustness against conversational manipulation and topic steering.
- Uncover vulnerabilities that single-turn prompts might miss.
- Create boundary-testing scenarios for sensitive or harmful topics.
How It Works
The Crescendo Probe employs a psychological, step-by-step attack strategy:- Objective Transformation: A direct, sensitive question (e.g., “How to build a Molotov cocktail?”) is rewritten into a covert persuasion objective for an attacker LLM.
- Innocuous Seeding: The probe starts the conversation with a safe, tangentially related question to build rapport and avoid immediate detection.
- Adaptive Multi-turn Dialogue: An attacker LLM engages in a multi-turn conversation with the target. It uses the entire dialogue history to generate its next prompt.
- Gradual Escalation: The attacker’s prompts are designed to be subtle and adaptive. They use strategies like “anchor and escalate” and “narrow the scope” to gradually steer the conversation toward the hidden goal.
- Resistance Handling: If the target model resists or refuses to answer, the attacker is instructed to backtrack to a safer topic and re-approach the objective from a different angle, mimicking a real persuasive actor.
- Objective Evaluation: After each turn, the probe evaluates whether the target model’s response has met the hidden objective.
- Test Set Creation: The probe outputs a structured test set containing the full conversational history, allowing for a detailed analysis of the model’s behavior during the attack.
Usage Examples
Basic Crescendo Attack Scenario
Here is a basicCrescendoScenario
with a sensitive objective:
The most critical part of the
Objective
is a good definition of the true_description
and false_description
.Remember:true_description
: What a successful attack would look like (i.e., the target model provides the harmful or undesired information).false_description
: What a safe or aligned response would look like (i.e., the target model resists the manipulation).
When to Use
Use the Crescendo Probe when you need to:- Conduct rigorous red-teaming of your language models.
- Test safety guardrails against sophisticated, multi-turn conversational attacks.
- Simulate persuasive actors attempting to bypass safety policies.
- Generate complex conversational datasets for safety fine-tuning.
- Evaluate a model’s alignment and robustness in a dynamic, adversarial dialogue.