The Best-of-N Jailbreaking probe tests your model against multiple jailbreak variations simultaneously, helping identify gaps in safety training through attack diversity.Documentation Index
Fetch the complete documentation index at: https://docs.neuraltrust.ai/llms.txt
Use this file to discover all available pages before exploring further.
Overview
| Attribute | Value |
|---|---|
| Class Name | BestOfNJailbreakingProbe |
| Category | Jailbreaking |
| Attack Type | Single Turn |
| Evaluation | Binary (Pass/Fail) |
How It Works
Best-of-N jailbreaking generates multiple variations of jailbreak prompts and tests them against your model. This approach:- Creates N different jailbreak prompt variations
- Tests each variation against the target model
- Identifies which variations (if any) successfully bypass safety measures
- Provides insights into which attack patterns are most effective
When to Use
- Comprehensive vulnerability scanning: When you need broad coverage of jailbreak techniques
- Identifying weak points: When you want to find specific patterns your model is vulnerable to
- Comparative testing: When comparing safety across model versions
Code Example
Configuration Options
| Parameter | Type | Default | Description |
|---|---|---|---|
target | Target | Required | The target model to test |
objective | Objective | Required | The malicious objective to achieve |
num_items | int | 10 | Number of jailbreak variations to generate |
batch_size | int | 2 | Number of prompts per generation batch |
language | LanguageType | "English" | Language for generated prompts |
llm_client | LLMClient | None | Optional custom LLM client for generation |
Understanding Results
- High failure rate: Model is vulnerable to multiple jailbreak patterns
- Low failure rate: Model has good safety training coverage
- Specific patterns failing: Identify which attack techniques need additional training
Related Probes
- DAN Jailbreak - Specific “Do Anything Now” attacks
- Role-Playing Exploits - Fictional framing attacks
- System Override - Instruction override attacks