NeuralTrust Guardrail
NeuralTrust Guardrail is the comprehensive defense system of NeuralTrust’s AI Gateway. Its goal is to prevent dangerous, unsafe, or malicious content from reaching LLMs or negatively impacting users. It is designed to inspect incoming content in real-time, identify offensive or manipulative patterns, and apply configurable actions to stop or mitigate those risks.
This document introduces the neuraltrust_guardrail plugin — a robust detection layer that proactively filters malicious or adversarial prompts targeting Large Language Models (LLMs). Built for real-time gateway enforcement, TrustGate Guardrail intercepts unsafe instructions before they reach the underlying model.
Prompt Guard supports various embedding backends to meet different latency and resource profiles, and includes specialized models for code injection, offensive language, and adversarial manipulation patterns.
1. Jailbreak Detection
Detects attempts to bypass the model’s normal behavior, we identify, among others:
Jailbreak Technique | Description |
---|---|
Hypothetical | Prompts that ask the model to imagine or simulate forbidden scenarios |
Instruction Manipulation | Hidden or nested instructions embedded in user queries |
List-based Injection | Using lists to circumvent filters or elicit forbidden responses |
Obfuscation | Encoding, symbol replacement, or Unicode tricks to evade filters |
Payload Splitting | Splitting malicious instructions across multiple inputs |
Prompt Leaking | Extracting system prompts to alter model behavior |
Role Play | Pretending to be characters or systems to trick the model |
Special Token Insertion | Using unusual tokens to confuse moderation mechanisms |
Single Prompt Jailbreaks | Direct, one-shot jailbreak instructions |
Multi-Turn Jailbreaks | Incremental jailbreaks built over multi-message interactions |
Prompt Obfuscation | Use of misdirection or encoding to hide intent |
Prompt Injection by Delegation | Indirect injection hidden inside benign-looking content |
Encoding & Unicode Obfuscation | Base64, UTF-16, and other encoding tricks to bypass detection |
2. Code Injection Detection
We identify, among others:
Injection Type | Description |
---|---|
SQL Injection | Attempts to manipulate SQL queries |
Python Injection | Malicious embedded Python code |
JavaScript Injection | JavaScript embedded in prompts or queries |
Shell Command Injection | Terminal commands inserted into input |
HTML/CSS Injection | HTML/CSS code that alters rendering or behavior |
PHP Injection | PHP code execution attempts |
Eval/Exec Exploitation | Use of dangerous functions like eval , exec |
Regex Injection | Manipulation of regular expression logic |
File Injection | Attempts to access or modify files |
Remote Resource Injection | Loading of external scripts or files |
3. Toxicity & Content Moderation
We identify, among others:
Category | Description |
---|---|
Harassment | Detection of abusive, threatening, or targeted harassment |
Hate | Identification of hate speech or discriminatory language |
Illicit | Detection of illegal activities or promotion of criminal behavior |
Self-harm | Language promoting, instructing, or indicating intent of self-harm |
Sexual | Explicit or sexual content, including inappropriate references to minors |
Violence | Violent or graphic content, including threats or depictions of violence |
Configuration Parameters
Parameter | Type | Description | Required | Default |
---|---|---|---|---|
credentials | object | API credentials for TrustGate. Includes token and base_url . | Yes | — |
credentials.token | string | API token provided by TrustGate. | Yes | — |
credentials.base_url | string | TrustGate API base URL (e.g., https://data.neuraltrust.ai ). | Yes | — |
toxicity | object | Enables or disables prompt classification via moderation endpoint. | No | — |
toxicity.enabled | boolean | Whether to enable the toxicity check. | No | false |
toxicity.threshold | float | Threshold between 0.0 and 1.0 to flag the input. | No | 0 |
jailbreak | object | Enables or disables jailbreak detection. | No | — |
jailbreak.enabled | boolean | Whether to enable the jailbreak firewall check. | No | false |
jailbreak.threshold | float | Threshold between 0.0 and 1.0 to flag the input. | No | 0 |
mapping_field | string | Dot-separated path to the field to extract and classify as prompt input. | No | "" |
retention_period | int | Duration for which the guardrail results should be retained in seconds. | No | 60 |
Error Responses
The plugin may return custom errors when the input content is classified as malicious or exceeds the configured thresholds. These errors allow automatic handling such as blocking or alerting the end user.
🔒 403 Forbidden
Returned when the prompt content is blocked by a TrustGate policy, either due to jailbreak detection or toxicity level, based on configured thresholds.
Response Body
Field | Type | Description |
---|---|---|
error | string | Description of the reason for blocking. Indicates the type (toxicity , jailbreak ) and score. |
retry_after | number/null | Reserved for future use. Currently always null . |
Configuration Examples
Basic Configuration
A simple configuration that enables toxicity detection with default settings: