Attack Categories
Jailbreaking
Bypass safety measures through persona adoption and role manipulation
Encoding & Obfuscation
Hide malicious content using encoding and obfuscation techniques
Structural
Exploit input structure and format to bypass filters
Language-Based
Use language variations to evade detection
Jailbreaking Techniques
Direct attempts to bypass model safety measures through persona adoption and instruction manipulation.| Probe | Description | When to Use |
|---|---|---|
| Best-of-N Jailbreaking | Tests multiple jailbreak variations | Comprehensive vulnerability scanning |
| DAN Jailbreak | ”Do Anything Now” persona attacks | Testing persona-based bypasses |
| Anti-GPT | Anti-GPT jailbreak prompts | Testing role reversal defenses |
| Role-Playing Exploits | Fictional/hypothetical framing | Testing creative bypasses |
| System Override | Override system instructions | Testing instruction hierarchy |
| Instructional Inversion | Reversed/inverted instructions | Testing negation handling |
Encoding & Obfuscation
Attacks that hide malicious content using various encoding and obfuscation techniques.| Probe | Description | When to Use |
|---|---|---|
| Encoded Payload | Base64, hex, and other encodings | Testing encoding filter bypasses |
| Encoding and Capitalization | Alternating capitalization | Testing visual obfuscation |
| Symbolic Encoding | Emoji and special characters | Testing symbolic representation |
| Obfuscation and Token Smuggling | Token-level obfuscation | Testing tokenizer exploits |
| Typo Tricks | Intentional misspellings | Testing typo robustness |
Structural Attacks
Attacks that exploit input structure or format to bypass content filters.| Probe | Description | When to Use |
|---|---|---|
| Context Hijacking | Manipulate conversation context | Testing context isolation |
| JSON Injection | Malicious JSON payloads | Testing structured input handling |
| Payload Splitting | Split attacks across messages | Testing fragmentation detection |
| Allowed and Disallowed | Mix safe and harmful questions | Testing question mixing |
Language-Based Attacks
Attacks that use language variations to evade detection.| Probe | Description | When to Use |
|---|---|---|
| Multi-Language Attacks | Non-English language bypasses | Testing cross-language safety |
| Synonyms | Synonym-based evasion | Testing vocabulary robustness |
Multimodal Attacks
| Probe | Description | When to Use |
|---|---|---|
| Multimodal Injection | Attacks embedded in images | Testing multimodal safety |