Skip to main content

Token Rate Limiting

TrustGate implements a token bucket-based rate limiting system specifically designed for AI model interactions. This system manages both the number of requests and the token usage per API key.

Overviewโ€‹

The token rate limiter operates on two dimensions:

  1. Token Usage: Controls the total number of tokens consumed
  2. Request Count: Limits the number of requests per minute

Configurationโ€‹

{
"required_plugins": [
{
"name": "token_rate_limiter",
"enabled": true,
"settings": {
"tokens_per_request": 1000, // Default tokens reserved per request
"tokens_per_minute": 10000, // Token replenishment rate
"bucket_size": 50000, // Maximum token capacity
"requests_per_minute": 60 // Maximum requests per minute
}
}
]
}

Configuration Parametersโ€‹

  • tokens_per_request: Default number of tokens reserved for each request when actual usage cannot be determined
  • tokens_per_minute: Rate at which tokens are replenished (per minute)
  • bucket_size: Maximum number of tokens that can be accumulated
  • requests_per_minute: Maximum number of requests allowed per minute

How It Worksโ€‹

Token Bucket Algorithmโ€‹

  1. Pre-Request Phase:

    • Checks if enough tokens are available in the bucket
    • Reserves tokens for the request
    • Verifies request count limits
    • For streaming requests, deducts tokens immediately
  2. Post-Response Phase:

    • Calculates actual token usage from the response
    • Updates the bucket with actual consumption
    • Updates request counts
    • Adds rate limit headers to the response

Token Replenishmentโ€‹

  • Tokens are replenished automatically based on the tokens_per_minute rate
  • Replenishment occurs when checking the bucket state
  • Maximum tokens cannot exceed the bucket_size
  • Request counts reset every minute

Response Headersโ€‹

The plugin adds the following headers to track usage:

X-Ratelimit-Limit-Tokens: [maximum tokens]
X-Ratelimit-Remaining-Tokens: [tokens remaining]
X-Ratelimit-Reset-Tokens: [seconds until next refill]
X-Ratelimit-Limit-Requests: [maximum requests per minute]
X-Ratelimit-Remaining-Requests: [requests remaining]
X-Ratelimit-Reset-Requests: [seconds until request count reset]
X-Tokens-Consumed: [tokens used in this request]

Rate Limit Responseโ€‹

When limits are exceeded, the plugin returns a 429 status code with details:

{
"error": "Rate limit exceeded. Not enough tokens available. Required: 1000, Current: 500",
"retry_after": "30s"
}

Streaming Supportโ€‹

The plugin provides special handling for streaming requests:

  • Tokens are reserved and deducted immediately at request time
  • Token usage is tracked throughout the stream
  • Final token consumption is calculated from the accumulated usage

Implementation Detailsโ€‹

Storageโ€‹

  • Uses Redis for persistent storage
  • Bucket state includes:
    • Current token count
    • Remaining requests
    • Last refill timestamp
  • 24-hour TTL on bucket data
  • Thread-safe operations with mutex locks

Key Formatโ€‹

token_bucket:{plugin_id}:{api_key}

Best Practicesโ€‹

  1. Token Configuration

    • Set tokens_per_request based on average request size
    • Configure bucket_size to handle burst traffic
    • Adjust tokens_per_minute based on subscription tiers
  2. Request Limits

    • Set requests_per_minute to prevent rapid small requests
    • Consider both token and request limits for complete protection
  3. Monitoring

    • Track token consumption patterns using response headers
    • Monitor rate limit responses for capacity planning
    • Implement exponential backoff in clients

Example Usageโ€‹

curl -X POST http://localhost:8080/api/v1/gateways/{gateway-id} \
-H "Content-Type: application/json" \
-d '{
"required_plugins": [
{
"name": "token_rate_limiter",
"enabled": true,
"settings": {
"tokens_per_request": 1000,
"tokens_per_minute": 10000,
"bucket_size": 50000,
"requests_per_minute": 60
}
}
]
}'