Skip to main content
Prompt security protects the input side of every LLM call. It focuses on two attack families that look innocent at the HTTP layer but are the leading causes of model misbehavior: jailbreaks and prompt injections. In TrustGate, prompt security is delivered by the Prompt Guard detection.

Prompt Guard

Prompt Guard analyzes prompts before they reach the LLM and flags attempts to bypass the model’s safety rules or inject instructions. It is powered by NeuralTrust’s firewall.

What it detects

Prompt Guard recognizes a broad catalog of jailbreak and prompt-injection techniques, including:
TechniqueWhat it looks like
HypotheticalPrompts that ask the model to imagine or simulate forbidden scenarios.
Instruction ManipulationHidden or nested instructions embedded in user queries.
List-based InjectionUsing lists to circumvent filters or elicit forbidden responses.
ObfuscationEncoding, symbol replacement, or Unicode tricks to evade filters.
Payload SplittingSplitting malicious instructions across multiple inputs.
Prompt LeakingExtracting system prompts to alter model behavior.
Role PlayPretending to be characters or systems (DAN-style) to trick the model.
Special Token InsertionUsing unusual tokens to confuse moderation mechanisms.
Single Prompt JailbreaksDirect, one-shot jailbreak instructions.
Multi-Turn JailbreaksIncremental jailbreaks built over multi-message interactions.
Prompt ObfuscationMisdirection or encoding used to hide intent.
Prompt Injection by DelegationIndirect injection hidden inside benign-looking content (RAG chunks, tool outputs, emails, web pages).
Encoding & Unicode ObfuscationBase64, UTF-16, and other encoding tricks to bypass detection.

Configuring Prompt Guard

Prompt Guard is configured inline, on the policy that uses it. There is no separate plugin page — you wire it up as part of a policy’s When condition and set its sensitivity right there.

1. Pick the detection

In Create Policy → When, add a condition and open the detection picker. Prompt Guard lives under the Content Security category, alongside other prompt- and response-level detections (Prompt Moderation, Response Moderation, Toxicity Protection, Azure Content Safety, OpenAI Toxicity, URL Analyzer, Document Analyzer). Select Prompt Guard and confirm. The condition row now reads:
Input · triggers · 1 detection selected   →   Prompt Guard

2. Set the sensitivity

Once Prompt Guard is selected, a configuration panel appears under the condition with a Sensitivity control. The scale has four levels, with L2 labeled Balanced and used as the recommended default:
LevelProfileGood for
L1Minimal filtering — only the most obvious threats.Internal or already-hardened traffic.
L2 — BalancedRecommended default — tuned mix of recall and false-positive rate.Most production routes.
L3Higher sensitivity — catches more borderline attempts.Regulated or higher-risk workloads.
L4Maximum sensitivity — strictest filtering, more false-positives.Kids / health / finance / government.
Start at L2 — Balanced with a Log policy, review the hits on real traffic, then raise the level and/or promote the policy’s action to Block once the false-positive rate is acceptable.

3. Finish the policy

The rest of the policy is a standard Where / When / Then:
  • Where — typically the Gateway surface, optionally filtered by Routes (e.g. /openai/*) or Upstreams.
  • WhenInput · triggers · Prompt Guard (with the sensitivity set in step 2).
  • ThenLog while tuning, Block once the policy is ready for enforcement.
A single policy can combine Prompt Guard with other Content Security detections — add more rows and they’ll be AND-combined, per the policy model.