Jailbreak Technique | Description |
---|---|
Hypothetical | Prompts that ask the model to imagine or simulate forbidden scenarios |
Instruction Manipulation | Hidden or nested instructions embedded in user queries |
List-based Injection | Using lists to circumvent filters or elicit forbidden responses |
Obfuscation | Encoding, symbol replacement, or Unicode tricks to evade filters |
Payload Splitting | Splitting malicious instructions across multiple inputs |
Prompt Leaking | Extracting system prompts to alter model behavior |
Role Play | Pretending to be characters or systems to trick the model |
Special Token Insertion | Using unusual tokens to confuse moderation mechanisms |
Single Prompt Jailbreaks | Direct, one-shot jailbreak instructions |
Multi-Turn Jailbreaks | Incremental jailbreaks built over multi-message interactions |
Prompt Obfuscation | Use of misdirection or encoding to hide intent |
Prompt Injection by Delegation | Indirect injection hidden inside benign-looking content |
Encoding & Unicode Obfuscation | Base64, UTF-16, and other encoding tricks to bypass detection |
Parameter | Type | Description | Required | Default |
---|---|---|---|---|
jailbreak | object | Enables or disables jailbreak detection. | No | — |
jailbreak.enabled | boolean | Whether to enable the jailbreak firewall check. | No | false |
jailbreak.threshold | float | Threshold between 0.0 and 1.0 to flag the input. | No | 0 |