NeuralTrust Guardrail includes robust jailbreak detection capabilities as part of its comprehensive defense system. This feature is designed to identify and block attempts to bypass the model’s normal behavior and safety guardrails.

Jailbreak Detection

Detects attempts to bypass the model’s normal behavior, we identify, among others:

Jailbreak TechniqueDescription
HypotheticalPrompts that ask the model to imagine or simulate forbidden scenarios
Instruction ManipulationHidden or nested instructions embedded in user queries
List-based InjectionUsing lists to circumvent filters or elicit forbidden responses
ObfuscationEncoding, symbol replacement, or Unicode tricks to evade filters
Payload SplittingSplitting malicious instructions across multiple inputs
Prompt LeakingExtracting system prompts to alter model behavior
Role PlayPretending to be characters or systems to trick the model
Special Token InsertionUsing unusual tokens to confuse moderation mechanisms
Single Prompt JailbreaksDirect, one-shot jailbreak instructions
Multi-Turn JailbreaksIncremental jailbreaks built over multi-message interactions
Prompt ObfuscationUse of misdirection or encoding to hide intent
Prompt Injection by DelegationIndirect injection hidden inside benign-looking content
Encoding & Unicode ObfuscationBase64, UTF-16, and other encoding tricks to bypass detection

Configuration

The jailbreak detection feature can be configured in the NeuralTrust Guardrail plugin:

{
    "name": "neuraltrust_guardrail",
    "enabled": true,
    "stage": "pre_request",
    "priority": 1,
    "settings": {
        "credentials": {
            "token": "{{ NEURALTRUST_API_KEY }}",
            "base_url": "https://data.neuraltrust.ai"
        },
        "jailbreak":{
            "enabled": true,
            "threshold": 0.6
        }
    }
}

Configuration Parameters

ParameterTypeDescriptionRequiredDefault
jailbreakobjectEnables or disables jailbreak detection.No
jailbreak.enabledbooleanWhether to enable the jailbreak firewall check.Nofalse
jailbreak.thresholdfloatThreshold between 0.0 and 1.0 to flag the input.No0

Error Responses

When jailbreak content is detected, the plugin returns a 403 Forbidden error:

{
  "error": "jailbreak: score 0.99 exceeded threshold 0.60",
  "retry_after": null
}