Skip to main content
The Token Rate Limiter plugin controls AI API usage by tracking the actual number of tokens consumed per client within a configurable time window. Unlike request-based rate limiting, this approach accounts for the real cost of each interaction — a short prompt and a 10,000-token response are treated proportionally.

How It Works

The plugin operates across two stages:
  1. Pre-request: Checks whether the client has already exhausted their token budget for the current window. If the limit is exceeded, the request is rejected with 429 Too Many Requests before it reaches the upstream.
  2. Post-response: After a successful upstream response, extracts the actual token usage (from both streaming and non-streaming responses) and records it against the client’s counter.
Token counting is based on the actual usage reported by the upstream provider (e.g., OpenAI, Anthropic), not estimates. The plugin uses a provider-aware response decoder to extract total_tokens from the response.

Configuration

{
  "name": "token_rate_limiter",
  "enabled": true,
  "stage": "pre_request",
  "settings": {
    "identifier_header": "X-API-Key",
    "window": {
      "unit": "hour",
      "max": 100000
    }
  }
}
The plugin must be configured in both pre_request and post_response stages:
{
  "plugin_chain": [
    {
      "name": "token_rate_limiter",
      "enabled": true,
      "stage": "pre_request",
      "settings": {
        "identifier_header": "X-API-Key",
        "window": {
          "unit": "hour",
          "max": 100000
        }
      }
    },
    {
      "name": "token_rate_limiter",
      "enabled": true,
      "stage": "post_response",
      "settings": {
        "identifier_header": "X-API-Key",
        "window": {
          "unit": "hour",
          "max": 100000
        }
      }
    }
  ]
}

Settings

ParameterTypeRequiredDescriptionDefault
identifier_headerstringNoHTTP header used to identify the clientFalls back to client IP
window.unitstringYesTime window unit: second, minute, hour, or dayminute
window.maxintegerYesMaximum tokens allowed within the window

Client Identification

The plugin identifies clients using the following priority:
  1. Header value — if identifier_header is set and the header is present in the request
  2. Client IP — falls back to the request’s source IP
  3. Global — if neither is available, a shared _global counter is used

Response Headers

Every response includes rate limit headers:
HeaderExampleDescription
X-Ratelimit-Limit-Tokens100000Maximum tokens allowed in the window
X-Ratelimit-Remaining-Tokens87500Tokens remaining in the current window
X-Ratelimit-Reset-Tokens1800sSeconds until the window resets
X-Tokens-Consumed1250Tokens consumed by this specific request (post-response only)

Rate Limit Response

When the token limit is exceeded, the plugin returns a 429 Too Many Requests error:
{
  "error": "Token rate limit exceeded. Consumed: 100500, Limit: 100000"
}
The response includes the rate limit headers so clients can determine when to retry.

Streaming Support

The plugin handles both streaming and non-streaming upstream responses:
  • Non-streaming: Decodes the full response body and extracts usage.total_tokens from the provider’s response format.
  • Streaming (SSE): Scans the stream chunks from the end to find the final chunk containing usage data, which most providers include in the last data: event.

Implementation Details

Sliding Window Counter

The plugin uses a fixed window counter backed by Redis:
  • A Redis key stores the cumulative token count for each client+window combination.
  • On the first write within a window, the key is created with a TTL equal to the window duration.
  • The key naturally expires when the window elapses, resetting the counter.
  • All counter operations use an atomic Lua script to prevent race conditions.

Key Format

trl:{plugin_id}:{identifier}
Where identifier is resolved from the header, client IP, or _global.

Provider Support

Token counting works with any provider supported by TrustGate’s response adapter layer. The plugin automatically detects the provider format from the request context and uses the appropriate decoder to extract token usage.

Best Practices

  1. Window Selection
    • Use hour or day for cost-control scenarios
    • Use minute or second for burst protection
    • Match the window to your billing granularity
  2. Setting Limits
    • Start with your upstream provider’s rate limits as a ceiling
    • Set per-client limits based on subscription tiers using identifier_header
    • Leave headroom for bursts — the counter is cumulative, not averaged
  3. Client Identification
    • Use identifier_header with API keys or tenant IDs for multi-tenant setups
    • Rely on IP-based identification only for simple deployments
  4. Monitoring
    • Track X-Tokens-Consumed headers to understand usage patterns
    • Monitor 429 responses for capacity planning
    • Implement exponential backoff in clients using X-Ratelimit-Reset-Tokens

Example

Setting up per-API-key limits

curl -X POST http://localhost:8080/api/v1/gateways/{gateway-id}/rules \
  -H "Content-Type: application/json" \
  -d '{
    "path": "/v1/chat/completions",
    "plugin_chain": [
      {
        "name": "token_rate_limiter",
        "enabled": true,
        "stage": "pre_request",
        "settings": {
          "identifier_header": "X-API-Key",
          "window": {
            "unit": "hour",
            "max": 50000
          }
        }
      },
      {
        "name": "token_rate_limiter",
        "enabled": true,
        "stage": "post_response",
        "settings": {
          "identifier_header": "X-API-Key",
          "window": {
            "unit": "hour",
            "max": 50000
          }
        }
      }
    ]
  }'
This limits each API key to 50,000 tokens per hour. When the limit is reached, subsequent requests receive a 429 response until the window expires.