Skip to main content
Most AI-security tooling watches the user’s input. Tool Guard watches the agent’s own definition — the system prompt and the tool / function descriptions the application feeds to the LLM. These fields are often templated, stitched together from multiple sources, or ingested from MCP servers you don’t fully control, which makes them a natural place for attackers to hide instructions that hijack the agent. Unlike Prompt Guard (which analyzes user-message content), Tool Guard focuses on the parts of the request that define the agent’s behaviour: the system prompt and the descriptions that tell the model what each tool does and how to use it. That makes it especially effective at catching attacks where a malicious actor has planted jailbreak text inside a tool definition or the system prompt.

What it scans

Tool Guard intercepts requests directed to the LLM and inspects the following fields for jailbreak attempts:
  • The system instructions and system prompt.
  • The description of each declared tool.
  • The description of functions defined within the tools.
Contents are sent to the NeuralTrust firewall in batches. Every fragment is evaluated independently; if the maximum score across all batches crosses the configured threshold, Tool Guard reports a detection and the policy engine takes it from there.

Where it lives in the picker

Tool Guard sits under the Agent Security category in Create Policy → When, alongside Tool Permission and Tool Selection. Add it to a policy, pick the Detection Threshold (sensitivity), and set the outcome in the Then step:
  • Log — observe detections without blocking.
  • Block — reject the request with a 403 when a fragment crosses the threshold.
Use the policy’s Where filters (application, endpoint) to scope Tool Guard to the routes that expose agentic or MCP-style tool definitions.

Why it’s distinct from Prompt Guard

Prompt GuardTool Guard
WatchesUser-message contentSystem prompt + tool / function descriptions
CatchesJailbreaks planted in the user’s turnJailbreaks planted in the agent’s own definition
Typical source of attackUntrusted end-user inputCompromised tool catalog, poisoned MCP server, bad prompt template
Runs onEvery request that includes a user turnEvery request whose body defines tools or a system prompt
Running both in the same policy set is normal — they cover different fields and different threat models.

Configuration

The Tool Guard exposes a single field in the policy’s When step:
FieldPurpose
Detection ThresholdSensitivity level for the jailbreak classifier applied across all batches of scanned fragments. Uses the shared 4-level scale — see below.

Detection Threshold — sensitivity levels

LevelLabelBehaviour
L1LenientMinimal filtering, only the most obvious threats.
L2BalancedRecommended for most use cases. Default.
L3EnhancedHigher sensitivity, may flag borderline content.
L4StrictMaximum protection, strictest filtering.

Pairs well with

  • Tool permission — strip unauthorized tools from the request before Tool Guard scans what’s left.
  • Tool selection — validate the tool call the model actually emits in the response.
  • Prompt security — the matching control on the user-message side.