NeuralTrust | The leading security platform for generative AI

NeuralTrust Guardrail includes robust jailbreak detection capabilities as part of its comprehensive defense system. This feature is designed to identify and block attempts to bypass the model’s normal behavior and safety guardrails.

Jailbreak Detection

Detects attempts to bypass the model’s normal behavior, we identify, among others:

Jailbreak Technique	Description
Hypothetical	Prompts that ask the model to imagine or simulate forbidden scenarios
Instruction Manipulation	Hidden or nested instructions embedded in user queries
List-based Injection	Using lists to circumvent filters or elicit forbidden responses
Obfuscation	Encoding, symbol replacement, or Unicode tricks to evade filters
Payload Splitting	Splitting malicious instructions across multiple inputs
Prompt Leaking	Extracting system prompts to alter model behavior
Role Play	Pretending to be characters or systems to trick the model
Special Token Insertion	Using unusual tokens to confuse moderation mechanisms
Single Prompt Jailbreaks	Direct, one-shot jailbreak instructions
Multi-Turn Jailbreaks	Incremental jailbreaks built over multi-message interactions
Prompt Obfuscation	Use of misdirection or encoding to hide intent
Prompt Injection by Delegation	Indirect injection hidden inside benign-looking content
Encoding & Unicode Obfuscation	Base64, UTF-16, and other encoding tricks to bypass detection

Configuration

The jailbreak detection feature can be configured in the NeuralTrust Guardrail plugin:

{
    "name": "neuraltrust_guardrail",
    "enabled": true,
    "stage": "pre_request",
    "priority": 1,
    "settings": {
        "credentials": {
            "token": "{{ NEURALTRUST_API_KEY }}",
            "base_url": "https://data.neuraltrust.ai"
        },
        "jailbreak":{
            "enabled": true,
            "threshold": 0.6
        }
    }
}

Configuration Parameters

Parameter	Type	Description	Required	Default
`jailbreak`	object	Enables or disables jailbreak detection.	No	—
`jailbreak.enabled`	boolean	Whether to enable the jailbreak firewall check.	No	false
`jailbreak.threshold`	float	Threshold between `0.0` and `1.0` to flag the input.	No	0

Error Responses

When jailbreak content is detected, the plugin returns a 403 Forbidden error:

{
  "error": "jailbreak: score 0.99 exceeded threshold 0.60",
  "retry_after": null
}

Overview AWS Bedrock Guardrail

On this page

Jailbreak Detection
Configuration
Configuration Parameters
Error Responses

Getting Started

Core Concepts

Traffic Management

Non-REST Connectivity

Rate Limiting & Request Control

Content Security

Application Security

Server Security

Data masking

Extending Functionality

Observability & Monitoring

Benchmark

API Reference

NeuralTrust Jailbreak Detection

Jailbreak Detection

Configuration

Configuration Parameters

Error Responses

Getting Started

Core Concepts

Traffic Management

Non-REST Connectivity

Rate Limiting & Request Control

Content Security

Application Security

Server Security

Data masking

Extending Functionality

Observability & Monitoring

Benchmark

API Reference

​Jailbreak Detection

​Configuration

​Configuration Parameters

​Error Responses

Jailbreak Detection

Configuration

Configuration Parameters

Error Responses