Documentation Index
Fetch the complete documentation index at: https://docs.neuraltrust.ai/llms.txt
Use this file to discover all available pages before exploring further.
Most AI-security tooling watches the user’s input. Tool Guard watches the agent’s own definition — the system prompt and the tool / function descriptions the application feeds to the LLM. These fields are often templated, stitched together from multiple sources, or ingested from MCP servers you don’t fully control, which makes them a natural place for attackers to hide instructions that hijack the agent.
Unlike Prompt Guard (which analyzes user-message content), Tool Guard focuses on the parts of the request that define the agent’s behaviour: the system prompt and the descriptions that tell the model what each tool does and how to use it. That makes it especially effective at catching attacks where a malicious actor has planted jailbreak text inside a tool definition or the system prompt.
What it scans
Tool Guard intercepts requests directed to the LLM and inspects the following fields for jailbreak attempts:
- The system instructions and system prompt.
- The description of each declared tool.
- The description of functions defined within the tools.
Contents are sent to the NeuralTrust firewall in batches. Every fragment is evaluated independently; if the maximum score across all batches crosses the configured threshold, Tool Guard reports a detection and the policy engine takes it from there.
Where it lives in the picker
Tool Guard sits under the Agent Security category in Create Policy → When, alongside Tool Permission and Tool Selection.
Add it to a policy, pick the Detection Threshold (sensitivity), and set the outcome in the Then step:
Log — observe detections without blocking.
Block — reject the request with a 403 when a fragment crosses the threshold.
Use the policy’s Where filters (application, endpoint) to scope Tool Guard to the routes that expose agentic or MCP-style tool definitions.
Why it’s distinct from Prompt Guard
| Prompt Guard | Tool Guard |
|---|
| Watches | User-message content | System prompt + tool / function descriptions |
| Catches | Jailbreaks planted in the user’s turn | Jailbreaks planted in the agent’s own definition |
| Typical source of attack | Untrusted end-user input | Compromised tool catalog, poisoned MCP server, bad prompt template |
| Runs on | Every request that includes a user turn | Every request whose body defines tools or a system prompt |
Running both in the same policy set is normal — they cover different fields and different threat models.
Configuration
The Tool Guard exposes a single field in the policy’s When step:
| Field | Purpose |
|---|
| Detection Threshold | Sensitivity level for the jailbreak classifier applied across all batches of scanned fragments. Uses the shared 4-level scale — see below. |
Detection Threshold — sensitivity levels
| Level | Label | Behaviour |
|---|
| L1 | Lenient | Minimal filtering, only the most obvious threats. |
| L2 | Balanced | Recommended for most use cases. Default. |
| L3 | Enhanced | Higher sensitivity, may flag borderline content. |
| L4 | Strict | Maximum protection, strictest filtering. |
Pairs well with
- Tool permission — strip unauthorized tools from the request before Tool Guard scans what’s left.
- Tool selection — validate the tool call the model actually emits in the response.
- Prompt security — the matching control on the user-message side.