Skip to main content
The Tool Budget Limiter plugin controls the total and per-tool token usage of agent tool calls. It prevents runaway costs by enforcing configurable token budgets and reacting when limits are exceeded (block, throttle, or alert-only).

What it does

  • Parses agent tool usage in requests and responses
  • Enforces an overall token budget and optional per-tool budgets
  • Actions when exceeding limits:
    • Block: return an error (HTTP 429)
    • Throttle: add a delay then allow
    • Alert only: allow and log/telemetry
  • Adds a response header X-TOKEN-LIMIT with within or exceeded
  • Works in both PreRequest and PreResponse stages

Configuration Parameters

ParameterTypeDescriptionRequiredDefault
modestringAction when limits exceed: throttle, block, alert_onlyYes
max_limit.max_tokensintOverall token budgetYes
max_limit.windowintBudget window in secondsYes
max_limit.ttlintCache TTL in secondsYes
tool_limits[].namestringTool nameNo
tool_limits[].max_tokensintPer-tool token budgetNo
tool_limits[].windowintPer-tool window in secondsNo
tool_limits[].ttlintPer-tool TTL in secondsNo
throttle_delayintDelay (seconds) when mode=throttleNo0
Behavior notes:
  • Stages: PreRequest (checks before execution), PreResponse (updates usage after execution)
  • Keys usage by gateway and rule internally; safe for multi-tenant setups
  • If parsing fails or no tools found, request is allowed

Prerequisites

These agent security plugins require upstreams configured in provider mode. See Upstream Services & Routing for details: /trustgate/core-concepts/upstream-services-overview Example upstream (provider mode):
{
  "name": "{{upstream_service_name}}",
  "algorithm": "round-robin",
  "targets": [
    {
      "provider": "openai",
      "provider_options": { "api": "responses" },
      "weight": 50,
      "priority": 1,
      "default_model": "gpt-4o-mini",
      "models": ["gpt-4", "gpt-4o-mini"],
      "stream": false,
      "credentials": { "api_key": "" }
    }
  ]
}

Example configuration

{
  "name": "tool_budget_limiter",
  "enabled": true,
  "stage": "pre_request",
  "priority": 1,
  "parallel": false,
  "settings": {
    "mode": "block",
    "max_limit": { "max_tokens": 50000, "window": 3600, "ttl": 3600 },
    "tool_limits": [
      { "name": "web_search", "max_tokens": 10000, "window": 1800, "ttl": 1800 },
      { "name": "db_query", "max_tokens": 5000, "window": 1800, "ttl": 1800 }
    ],
    "throttle_delay": 5
  }
}
To record usage, also add the plugin in pre_response stage with the same settings:
{
  "name": "tool_budget_limiter",
  "enabled": true,
  "stage": "pre_response",
  "priority": 100,
  "parallel": false,
  "settings": {
    "mode": "block",
    "max_limit": { "max_tokens": 50000, "window": 3600, "ttl": 3600 },
    "tool_limits": [
      { "name": "web_search", "max_tokens": 10000, "window": 1800, "ttl": 1800 },
      { "name": "db_query", "max_tokens": 5000, "window": 1800, "ttl": 1800 }
    ],
    "throttle_delay": 5
  }
}

Best practices

  • Start with alert-only to observe typical budgets, then switch to block/throttle
  • Use per-tool budgets for expensive tools to prevent localized spikes
  • Prefer reasonable windows (e.g., 30–60 minutes) and align TTL with window
  • Surface the X-TOKEN-LIMIT header to clients/ops dashboards for visibility

Compatibility

Currently supports agents using the OpenAI LLM request/response format only.