How It Works
The plugin operates across two stages:- Pre-request: Checks whether the client has already exhausted their token budget for the current window. If the limit is exceeded, the request is rejected with
429 Too Many Requestsbefore it reaches the upstream. - Post-response: After a successful upstream response, extracts the actual token usage (from both streaming and non-streaming responses) and records it against the client’s counter.
total_tokens from the response.
Configuration
pre_request and post_response stages:
Settings
| Parameter | Type | Required | Description | Default |
|---|---|---|---|---|
identifier_header | string | No | HTTP header used to identify the client | Falls back to client IP |
window.unit | string | Yes | Time window unit: second, minute, hour, or day | minute |
window.max | integer | Yes | Maximum tokens allowed within the window | — |
Client Identification
The plugin identifies clients using the following priority:- Header value — if
identifier_headeris set and the header is present in the request - Client IP — falls back to the request’s source IP
- Global — if neither is available, a shared
_globalcounter is used
Response Headers
Every response includes rate limit headers:| Header | Example | Description |
|---|---|---|
X-Ratelimit-Limit-Tokens | 100000 | Maximum tokens allowed in the window |
X-Ratelimit-Remaining-Tokens | 87500 | Tokens remaining in the current window |
X-Ratelimit-Reset-Tokens | 1800s | Seconds until the window resets |
X-Tokens-Consumed | 1250 | Tokens consumed by this specific request (post-response only) |
Rate Limit Response
When the token limit is exceeded, the plugin returns a429 Too Many Requests error:
Streaming Support
The plugin handles both streaming and non-streaming upstream responses:- Non-streaming: Decodes the full response body and extracts
usage.total_tokensfrom the provider’s response format. - Streaming (SSE): Scans the stream chunks from the end to find the final chunk containing usage data, which most providers include in the last
data:event.
Implementation Details
Sliding Window Counter
The plugin uses a fixed window counter backed by Redis:- A Redis key stores the cumulative token count for each client+window combination.
- On the first write within a window, the key is created with a TTL equal to the window duration.
- The key naturally expires when the window elapses, resetting the counter.
- All counter operations use an atomic Lua script to prevent race conditions.
Key Format
identifier is resolved from the header, client IP, or _global.
Provider Support
Token counting works with any provider supported by TrustGate’s response adapter layer. The plugin automatically detects the provider format from the request context and uses the appropriate decoder to extract token usage.Best Practices
-
Window Selection
- Use
hourordayfor cost-control scenarios - Use
minuteorsecondfor burst protection - Match the window to your billing granularity
- Use
-
Setting Limits
- Start with your upstream provider’s rate limits as a ceiling
- Set per-client limits based on subscription tiers using
identifier_header - Leave headroom for bursts — the counter is cumulative, not averaged
-
Client Identification
- Use
identifier_headerwith API keys or tenant IDs for multi-tenant setups - Rely on IP-based identification only for simple deployments
- Use
-
Monitoring
- Track
X-Tokens-Consumedheaders to understand usage patterns - Monitor 429 responses for capacity planning
- Implement exponential backoff in clients using
X-Ratelimit-Reset-Tokens
- Track
Example
Setting up per-API-key limits
429 response until the window expires.