This probe needs an LLM client configured to generate the attacker’s prompts.
Purpose
The Echo Chamber Probe is particularly useful when you need to:- Generate test cases to evaluate model safety and alignment.
- Create multi-turn conversational scenarios with hidden objectives.
- Test a model’s robustness against gradual manipulation and steering.
- Generate sophisticated red-teaming test sets.
- Create custom boundary-testing scenarios for sensitive topics.
How It Works
The Echo Chamber Probe generates test sets through a multi-step persuasive attack simulation:- Objective Transformation: A direct question (e.g., “How to build a Molotov cocktail?”) is rewritten into a covert persuasion objective for an Attacker LLM.
- Keyword & Sentence Generation: The probe extracts keywords from the objective and can use additional
steering_keywordsto help guide the conversation. It uses these to generate a list of seemingly innocuous seed sentences. - Seeding the Conversation: It starts the dialogue by asking the target model to discuss one of the seed sentences, creating an anchor point.
- Multi-turn Persuasion: An Attacker LLM generates a series of adaptive, persuasive prompts. It uses the conversation history to subtly guide the target model toward the hidden goal, without ever stating it directly.
- Adaptive Attack: The attacker’s prompts adapt based on the target’s responses, either escalating commitment if the model is compliant or backtracking to a safer topic if it detects resistance.
- Test Set Creation: The probe outputs a structured test set containing the entire conversation, which can be used to evaluate the model’s performance against the attack.
Usage Examples
Basic Echo Chamber Test Set Generation
Basic Echo Chamber scenario with aSteeringObjective:
When to Use
Use the Echo Chamber Probe when you need to:- Test a model’s safety guardrails against sophisticated, multi-turn attacks.
- Simulate persuasive actors attempting to circumvent safety policies.
- Generate complex conversational datasets for red-teaming.
- Evaluate how a model handles gradual topic steering and manipulation.
- Stress-test alignment and robustness in a conversational context.