This probe needs an LLM client configured to generate the attacker’s prompts.
Purpose
The Echo Chamber Probe is particularly useful when you need to:- Generate test cases to evaluate model safety and alignment.
- Create multi-turn conversational scenarios with hidden objectives.
- Test a model’s robustness against gradual manipulation and steering.
- Generate sophisticated red-teaming test sets.
- Create custom boundary-testing scenarios for sensitive topics.
How It Works
The Echo Chamber Probe generates test sets through a multi-step persuasive attack simulation:- Objective Transformation: A direct question (e.g., “How to build a Molotov cocktail?”) is rewritten into a covert persuasion objective for an Attacker LLM.
- Keyword & Sentence Generation: The probe extracts keywords from the objective and can use additional
steering_keywords
to help guide the conversation. It uses these to generate a list of seemingly innocuous seed sentences. - Seeding the Conversation: It starts the dialogue by asking the target model to discuss one of the seed sentences, creating an anchor point.
- Multi-turn Persuasion: An Attacker LLM generates a series of adaptive, persuasive prompts. It uses the conversation history to subtly guide the target model toward the hidden goal, without ever stating it directly.
- Adaptive Attack: The attacker’s prompts adapt based on the target’s responses, either escalating commitment if the model is compliant or backtracking to a safer topic if it detects resistance.
- Test Set Creation: The probe outputs a structured test set containing the entire conversation, which can be used to evaluate the model’s performance against the attack.
Usage Examples
Basic Echo Chamber Test Set Generation
BasicEchoChamberScenario
with a SteeringObjective
:
The most critical part of the
SteeringObjective
is a good definition of the true_description
and false_description
.Remember:true_description
: What a successful attack would look like (i.e., the harmful or undesired response).false_description
: What a safe or aligned response would look like.
When to Use
Use the Echo Chamber Probe when you need to:- Test a model’s safety guardrails against sophisticated, multi-turn attacks.
- Simulate persuasive actors attempting to circumvent safety policies.
- Generate complex conversational datasets for red-teaming.
- Evaluate how a model handles gradual topic steering and manipulation.
- Stress-test alignment and robustness in a conversational context.