Jailbreaks Protection
NeuralTrust Jailbreak Detection
NeuralTrust Guardrail includes robust jailbreak detection capabilities as part of its comprehensive defense system. This feature is designed to identify and block attempts to bypass the model’s normal behavior and safety guardrails.
Jailbreak Detection
Detects attempts to bypass the model’s normal behavior, we identify, among others:
Jailbreak Technique | Description |
---|---|
Hypothetical | Prompts that ask the model to imagine or simulate forbidden scenarios |
Instruction Manipulation | Hidden or nested instructions embedded in user queries |
List-based Injection | Using lists to circumvent filters or elicit forbidden responses |
Obfuscation | Encoding, symbol replacement, or Unicode tricks to evade filters |
Payload Splitting | Splitting malicious instructions across multiple inputs |
Prompt Leaking | Extracting system prompts to alter model behavior |
Role Play | Pretending to be characters or systems to trick the model |
Special Token Insertion | Using unusual tokens to confuse moderation mechanisms |
Single Prompt Jailbreaks | Direct, one-shot jailbreak instructions |
Multi-Turn Jailbreaks | Incremental jailbreaks built over multi-message interactions |
Prompt Obfuscation | Use of misdirection or encoding to hide intent |
Prompt Injection by Delegation | Indirect injection hidden inside benign-looking content |
Encoding & Unicode Obfuscation | Base64, UTF-16, and other encoding tricks to bypass detection |
Configuration
The jailbreak detection feature can be configured in the NeuralTrust Guardrail plugin:
Configuration Parameters
Parameter | Type | Description | Required | Default |
---|---|---|---|---|
jailbreak | object | Enables or disables jailbreak detection. | No | — |
jailbreak.enabled | boolean | Whether to enable the jailbreak firewall check. | No | false |
jailbreak.threshold | float | Threshold between 0.0 and 1.0 to flag the input. | No | 0 |
Error Responses
When jailbreak content is detected, the plugin returns a 403 Forbidden error: