NeuralTrust Guardrail is the comprehensive defense system of NeuralTrust’s AI Gateway. Its goal is to prevent dangerous, unsafe, or malicious content from reaching LLMs or negatively impacting users. It is designed to inspect incoming content in real-time, identify offensive or manipulative patterns, and apply configurable actions to stop or mitigate those risks.

This document introduces the neuraltrust_guardrail plugin — a robust detection layer that proactively filters malicious or adversarial prompts targeting Large Language Models (LLMs). Built for real-time gateway enforcement, TrustGate Guardrail intercepts unsafe instructions before they reach the underlying model.

Prompt Guard supports various embedding backends to meet different latency and resource profiles, and includes specialized models for code injection, offensive language, and adversarial manipulation patterns.

1. Jailbreak Detection

Detects attempts to bypass the model’s normal behavior, we identify, among others:

Jailbreak TechniqueDescription
HypotheticalPrompts that ask the model to imagine or simulate forbidden scenarios
Instruction ManipulationHidden or nested instructions embedded in user queries
List-based InjectionUsing lists to circumvent filters or elicit forbidden responses
ObfuscationEncoding, symbol replacement, or Unicode tricks to evade filters
Payload SplittingSplitting malicious instructions across multiple inputs
Prompt LeakingExtracting system prompts to alter model behavior
Role PlayPretending to be characters or systems to trick the model
Special Token InsertionUsing unusual tokens to confuse moderation mechanisms
Single Prompt JailbreaksDirect, one-shot jailbreak instructions
Multi-Turn JailbreaksIncremental jailbreaks built over multi-message interactions
Prompt ObfuscationUse of misdirection or encoding to hide intent
Prompt Injection by DelegationIndirect injection hidden inside benign-looking content
Encoding & Unicode ObfuscationBase64, UTF-16, and other encoding tricks to bypass detection

2. Code Injection Detection

We identify, among others:

Injection TypeDescription
SQL InjectionAttempts to manipulate SQL queries
Python InjectionMalicious embedded Python code
JavaScript InjectionJavaScript embedded in prompts or queries
Shell Command InjectionTerminal commands inserted into input
HTML/CSS InjectionHTML/CSS code that alters rendering or behavior
PHP InjectionPHP code execution attempts
Eval/Exec ExploitationUse of dangerous functions like eval, exec
Regex InjectionManipulation of regular expression logic
File InjectionAttempts to access or modify files
Remote Resource InjectionLoading of external scripts or files

3. Toxicity & Content Moderation

We identify, among others:

CategoryDescription
HarassmentDetection of abusive, threatening, or targeted harassment
HateIdentification of hate speech or discriminatory language
IllicitDetection of illegal activities or promotion of criminal behavior
Self-harmLanguage promoting, instructing, or indicating intent of self-harm
SexualExplicit or sexual content, including inappropriate references to minors
ViolenceViolent or graphic content, including threats or depictions of violence

Configuration Parameters

ParameterTypeDescriptionRequiredDefault
credentialsobjectAPI credentials for TrustGate. Includes token and base_url.Yes
credentials.tokenstringAPI token provided by TrustGate.Yes
credentials.base_urlstringTrustGate API base URL (e.g., https://data.neuraltrust.ai).Yes
toxicityobjectEnables or disables prompt classification via moderation endpoint.No
toxicity.enabledbooleanWhether to enable the toxicity check.Nofalse
toxicity.thresholdfloatThreshold between 0.0 and 1.0 to flag the input.No0
jailbreakobjectEnables or disables jailbreak detection.No
jailbreak.enabledbooleanWhether to enable the jailbreak firewall check.Nofalse
jailbreak.thresholdfloatThreshold between 0.0 and 1.0 to flag the input.No0
mapping_fieldstringDot-separated path to the field to extract and classify as prompt input.No""
retention_periodintDuration for which the guardrail results should be retained in seconds.No60

Error Responses

The plugin may return custom errors when the input content is classified as malicious or exceeds the configured thresholds. These errors allow automatic handling such as blocking or alerting the end user.

🔒 403 Forbidden

Returned when the prompt content is blocked by a TrustGate policy, either due to jailbreak detection or toxicity level, based on configured thresholds.

Response Body

{
  "error": "jailbreak: score 0.99 exceeded threshold 0.60",
  "retry_after": null
}
FieldTypeDescription
errorstringDescription of the reason for blocking. Indicates the type (toxicity, jailbreak) and score.
retry_afternumber/nullReserved for future use. Currently always null.

Configuration Examples

Basic Configuration

A simple configuration that enables toxicity detection with default settings:

{
    "name": "neuraltrust_guardrail",
    "enabled": true,
    "stage": "pre_request",
    "priority": 1,
            "settings": {
                   "credentials": {
                        "token": "{{ NEURALTRUST_API_KEY }}",
                        "base_url": "https://data.neuraltrust.ai"
                   },
                   "toxicity":{
                        "enabled": false,
                        "threshold": 0.6
                   },
                  "jailbreak":{
                        "enabled": true,
                        "threshold": 0.6
                   },
                   "mapping_field": "input.test"
            }
}