NeuralTrust | The leading security platform for generative AI

Most AI-security tooling watches the user’s input. Tool Guard watches the agent’s own definition — the system prompt and the tool / function descriptions the application feeds to the LLM. These fields are often templated, stitched together from multiple sources, or ingested from MCP servers you don’t fully control, which makes them a natural place for attackers to hide instructions that hijack the agent. Unlike Prompt Guard (which analyzes user-message content), Tool Guard focuses on the parts of the request that define the agent’s behaviour: the system prompt and the descriptions that tell the model what each tool does and how to use it. That makes it especially effective at catching attacks where a malicious actor has planted jailbreak text inside a tool definition or the system prompt.

What it scans

Tool Guard intercepts requests directed to the LLM and inspects the following fields for jailbreak attempts:

The system instructions and system prompt.
The description of each declared tool.
The description of functions defined within the tools.

Contents are sent to the NeuralTrust firewall in batches. Every fragment is evaluated independently; if the maximum score across all batches crosses the configured threshold, Tool Guard reports a detection and the policy engine takes it from there.

Where it lives in the picker

Tool Guard sits under the Agent Security category in Create Policy → When, alongside Tool Permission and Tool Selection. Add it to a policy, pick the Detection Threshold (sensitivity), and set the outcome in the Then step:

Log — observe detections without blocking.
Block — reject the request with a 403 when a fragment crosses the threshold.

Use the policy’s Where filters (application, endpoint) to scope Tool Guard to the routes that expose agentic or MCP-style tool definitions.

Why it’s distinct from Prompt Guard

	Prompt Guard	Tool Guard
Watches	User-message content	System prompt + tool / function descriptions
Catches	Jailbreaks planted in the user’s turn	Jailbreaks planted in the agent’s own definition
Typical source of attack	Untrusted end-user input	Compromised tool catalog, poisoned MCP server, bad prompt template
Runs on	Every request that includes a user turn	Every request whose body defines tools or a system prompt

Running both in the same policy set is normal — they cover different fields and different threat models.

Configuration

The Tool Guard exposes a single field in the policy’s When step:

Field	Purpose
Detection Threshold	Sensitivity level for the jailbreak classifier applied across all batches of scanned fragments. Uses the shared 4-level scale — see below.

Detection Threshold — sensitivity levels

Level	Label	Behaviour
L1	Lenient	Minimal filtering, only the most obvious threats.
L2	Balanced	Recommended for most use cases. Default.
L3	Enhanced	Higher sensitivity, may flag borderline content.
L4	Strict	Maximum protection, strictest filtering.

Pairs well with

Tool permission — strip unauthorized tools from the request before Tool Guard scans what’s left.
Tool selection — validate the tool call the model actually emits in the response.
Prompt security — the matching control on the user-message side.

​What it scans

​Where it lives in the picker

​Why it’s distinct from Prompt Guard

​Configuration

​Detection Threshold — sensitivity levels

​Pairs well with