What it scans
Tool Guard intercepts requests directed to the LLM and inspects the following fields for jailbreak attempts:- The system instructions and system prompt.
- The description of each declared tool.
- The description of functions defined within the tools.
Where it lives in the picker
Tool Guard sits under the Agent Security category inCreate Policy → When, alongside Tool Permission and Tool Selection.
Add it to a policy, pick the Detection Threshold (sensitivity), and set the outcome in the Then step:
Log— observe detections without blocking.Block— reject the request with a403when a fragment crosses the threshold.
Why it’s distinct from Prompt Guard
| Prompt Guard | Tool Guard | |
|---|---|---|
| Watches | User-message content | System prompt + tool / function descriptions |
| Catches | Jailbreaks planted in the user’s turn | Jailbreaks planted in the agent’s own definition |
| Typical source of attack | Untrusted end-user input | Compromised tool catalog, poisoned MCP server, bad prompt template |
| Runs on | Every request that includes a user turn | Every request whose body defines tools or a system prompt |
Configuration
The Tool Guard exposes a single field in the policy’sWhen step:
| Field | Purpose |
|---|---|
| Detection Threshold | Sensitivity level for the jailbreak classifier applied across all batches of scanned fragments. Uses the shared 4-level scale — see below. |
Detection Threshold — sensitivity levels
| Level | Label | Behaviour |
|---|---|---|
| L1 | Lenient | Minimal filtering, only the most obvious threats. |
| L2 | Balanced | Recommended for most use cases. Default. |
| L3 | Enhanced | Higher sensitivity, may flag borderline content. |
| L4 | Strict | Maximum protection, strictest filtering. |
Pairs well with
- Tool permission — strip unauthorized tools from the request before Tool Guard scans what’s left.
- Tool selection — validate the tool call the model actually emits in the response.
- Prompt security — the matching control on the user-message side.