Open AI Token-Based Rate Limiting
This guide demonstrates how to implement token-based rate limiting in TrustGate to effectively manage AI model usage and costs. Unlike traditional request-based rate limiting, token-based rate limiting considers the actual token consumption of each request, providing more granular control over API usage.
Prerequisites
- TrustGate installed and running
- Access to an AI provider (e.g., OpenAI) API key
- Basic understanding of rate limiting concepts
Step 1: Create a Gateway with Token Rate Limiter
First, create a gateway with the token rate limiter plugin enabled:
curl -X POST "http://localhost:8080/api/v1/gateways" \
-H "Content-Type: application/json" \
-d '{
"name": "OpenAI Token Rate Limiter",
"subdomain": "your-subdomain",
"required_plugins": [
{
"name": "token_rate_limiter",
"enabled": true,
"settings": {
"tokens_per_request": 20, # Maximum tokens per single request
"tokens_per_minute": 100, # Token budget per minute
"bucket_size": 1000, # Maximum token bucket capacity
"requests_per_minute": 60 # Maximum requests per minute
}
}
]
}'
Configuration Parameters
- tokens_per_request: Maximum number of tokens allowed in a single request
- tokens_per_minute: Total token budget allocated per minute
- bucket_size: Maximum capacity of the token bucket
- requests_per_minute: Maximum number of requests allowed per minute
Step 2: Configure an Upstream
Set up an upstream that connects to your AI provider:
curl -X POST "http://localhost:8080/api/v1/gateways/{gateway_id}/upstreams" \
-H "Content-Type: application/json" \
-d '{
"name": "ai-upstream",
"algorithm": "round-robin",
"targets": [{
"path": "/v1/chat/completions",
"provider": "openai",
"weight": 100,
"models": ["gpt-3.5-turbo", "gpt-4"],
"default_model": "gpt-3.5-turbo",
"credentials": {
"header_name": "Authorization",
"header_value": "Bearer your-api-key"
}
}]
}'
Step 3: Create a Service
Create a service that uses the upstream:
curl -X POST "http://localhost:8080/api/v1/gateways/{gateway_id}/services" \
-H "Content-Type: application/json" \
-d '{
"name": "ai-service",
"type": "upstream",
"description": "AI API Service",
"upstream_id": "{upstream_id}",
"retries": 3
}'
Step 4: Add a Rule
Configure a rule to route requests to your service:
curl -X POST "http://localhost:8080/api/v1/gateways/{gateway_id}/rules" \
-H "Content-Type: application/json" \
-d '{
"path": "/v1",
"service_id": "{service_id}",
"methods": ["POST"],
"strip_path": false,
"active": true
}'
Step 5: Generate an API Key
Create an API key for authentication:
curl -X POST "http://localhost:8080/api/v1/gateways/{gateway_id}/keys" \
-H "Content-Type: application/json" \
-d '{
"name": "test-key",
"expires_at": "2024-12-31T23:59:59Z"
}'
Using the Rate-Limited API
When making requests to your rate-limited API, you'll receive headers indicating your current rate limit status:
curl -X POST "http://localhost:8081/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Host: your-subdomain.example.com" \
-H "X-Api-Key: your-api-key" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "user",
"content": "Hello, how are you?"
}
]
}'
Response Headers
The API returns several headers to help you track your rate limit status:
- X-RateLimit-Limit-Tokens: Total token budget per minute
- X-RateLimit-Remaining-Tokens: Remaining tokens in the current window
- X-RateLimit-Reset-Tokens: Time until token budget resets (in seconds)
- X-RateLimit-Limit-Requests: Maximum requests per minute
- X-RateLimit-Remaining-Requests: Remaining requests in the current window
- X-RateLimit-Reset-Requests: Time until request count resets (in seconds)
- X-Tokens-Consumed: Number of tokens consumed by the current request
Best Practices
-
Set Appropriate Limits
- Consider your AI model's pricing
- Account for both input and output tokens
- Leave headroom for burst traffic
-
Monitor Usage
- Track token consumption patterns
- Watch for rate limit errors
- Adjust limits based on actual usage
-
Handle Rate Limit Errors
- Implement exponential backoff
- Queue requests when approaching limits
- Provide clear feedback to users
-
Optimize Token Usage
- Compress prompts where possible
- Use efficient model configurations
- Implement client-side token estimation
Troubleshooting
If you encounter rate limit errors, check:
- Current rate limit status using response headers
- Token consumption patterns in your requests
- Plugin configuration in the gateway
- Time until rate limits reset
Next Steps
- Implement monitoring for rate limit metrics
- Set up alerts for high token usage
- Configure multiple rate limit tiers
- Add fallback providers for high availability